
UNSUPERVISED BUILDING AND EXPLOITATION OF COMPOSITE 

DESCRIPTORS 

Field of the Invention 

5 The present invention relates to sequences of symbols and, more particularly, to 

unsupervised building and exploitation of composite descriptors. 

Background of the Invention 

Sequences of symbols are useful in a number of areas. One such area is DNA. 
10 DNA (deoxyribonucleic acid) may be described through a long sequence of symbols. 
DNA is commonly described through the characters A, G, C, or T. These characters may 
be thought of as the alphabet of DNA. Another area where sequences of symbols are 
important is proteins. Proteins are sequences of amino acids, where each amino acid can 
be described by a character or letter. The "alphabet" of amino acids comprises the 

1 5 characters of A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. Sequences 
of symbols are also important in encryption and coding. For example, computers 
commonly store character data in numeric format. For instance, the word "the" could be 
coded in the American Standard Code for Information Interchange (ASCII) format as 
decimal symbols 116, 104, and 101. Encryption schemes change these numbers to 

20 conceal the underlying information. 

For amino acids, there are very large databases of knowledge that consist of 
sequences of proteins. Similar proteins are usually grouped into "families." Family 
members should have the same properties associated with them; once the properties of 
one of the family members is known, it is assumed that the other family members will 

25 have similar properties. Additionally, once the family is known, the family may be used 
to determine which candidate proteins are members of the family. Therefore, there has 
been tremendous research to determine how to best group proteins into families. 

Generally, there are four different methods used to group proteins. One method 
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is to determine a pattern of symbols that all of the sequences share. This is called a single 
descriptor approach, which looks for particular patterns of characters. The patterns are 
series of expected amino acids, described by alphabetic characters. In the pattern, some 
locations could be important and some locations might not be. An example pattern for a 
single descriptor might require certain amino acids to be in one particular location, then 
allow several "don't care" locations where any amino acid could reside, and then require 
only a particular amino acid in a final location. The patterns are based on observations 
that, in nature, specific amino acid positions seem to be preserved in a biased way. These 
specific amino acids positions are "conserved" even though their neighbors can undergo 
mutations. Thus, researchers used the concept of conservation to describe the members 
of the family. A very large, well known database of the single descriptor type is the 
Prosite database. There are about 1100 families in this database. To find the patterns 
contained in each family, the proteins contained there were first aligned. Then, the most 
conserved region of the family was located and the pattern (the single descriptor) 
contained in all or most of the family members was determined. However, there could be 
members of a family that did not share the single descriptor. This generates false 
negatives, as members of the family were incorrectly not discovered as such. 

An improvement on the single descriptor method is the composite descriptor 
method. The composite descriptor method examines a candidate protein for several 
alphabetic patterns, as opposed to only one pattern with the single descriptor method. 
Again, this method generally requires aligning the proteins so that the multiple patterns, 
i.e., the composite descriptor, properly align within their respective blocks. 

The conceptual underpinnings are the same across all the methods that rely on 
composite descriptors. Any differences have essentially to do with either the manner in 
which multiple alignments are used to construct the descriptors or whether the descriptors 
are explicitly (e.g., a "regular expression") or implicitly (e.g., a "profile") represented in 
the composite description. Additional characteristics common to these approaches 
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include: (a) an iterative component; (b) the availability of a set of known (or alleged) 
family members (= "training" set) that provides an initial "bootstrapping" stage; (c) the 
computation of a multiple-sequence alignment involving members of the training set - 
these alignments are typically verified manually or semi-automatically and can be used to 
derive profiles that allow the generation of quality measures when evaluating the results; 
(d) a range of quality control checks that are optionally applied on the generated results; 
and, (e) the need to study the collection under consideration in order to identify a 
minimum set of components that will form the composite description. 

There are several problems with these approaches. For instance, in step (c), it is 
implicitly assumed that there is a multiple-sequence alignment involving all of the 
members of the training set; the alignment may either be a global alignment of both 
conserved and non-conserved regions, or a local alignment of the most conserved regions. 
This requirement unnecessarily burdens these methods. Additionally, multiple alignment 
programs usually work best when the parameters are optimized for the set of sequences 
which are being considered. 

Steps (d) and (e) presuppose the availability of biological information pertaining 
to the set under consideration, and this biological information may not always be present. 
As a matter of fact, step (e) results in the selection and use of features which are 
conditional on each other. Although easy to describe, an additional assumption here is 
that the identity, cardinality, and properties of these features are available and also agreed 
upon ahead of time. For example, a statement such as "G protein-coupled receptors 
(GPCRs) are proteins involved in signal transduction in eukaryotic organisms that consist 
of seven transmembrane helices composed typically of hydrophobic amino acids" 
represents a body of knowledge that has been used by researchers in the building of 
composite descriptors for GPCRs. With the supervised approaches described above, a 
detailed and frequently manual study of the collection under consideration is unavoidable. 



In addition to descriptor anproaches, there are also "windowing" approaches that 
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build descriptors for a family. In these methods, one or more window^ are used instead of 
character patterns. A single window method is called the PROFILE approach. All of the 
sequences of each of the family members are aligned with resgfect to their best-conserved 
region. Researchers then determined a probability disjjftmtion for locations in each 
column of the implied window. For each such blocj^f ;hey determined a probability of 
expecting an amino acid at some location within tife window and thus built a 'profile' of 
expected probabilities for each of the column/ of the window. The researchers would 
slide this set of probabilities against aiyunknown protein. If this candidate protein 
matched the expected probabilities, th&y- included the protein as a member of the family. 
This approach was more tolerant/fhan the single descriptor approach. Subsequently, 
researchers began to use profits for multiple widows. There could be two, three, four 
windows where the members of the family could agree on content. Sometimes, a profile 
was not built explicitly^ rather was maintained as a collection of the instances across 
the known or alleged family members of the conserved region under consideration. 

The windowing methods again rely on alignment of proteins, which can be 
relatively complex and computationally lengthy. Typically, these windowing methods are 
supervised and biological information pertaining to the family can facilitate the analysis. 
With supervised approaches, a detailed and frequently manual study of the collection 
under consideration is unavoidable. 

Therefore, there exists a need to provide a way of determining and using family 
members of sequences in an unsupervised manner, without knowledge of biological 
information related to the family, and without aligning the sequences. 

Summary of the Invention 

Generally, the present invention provides a way of determining in an 
unsupervised manner additional members for a family that is defined initially through 
exemplar sequences. The present invention is unsupervised in that it proceeds without 
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any information related to the exemplar sequences defining the family, without aligning 
the exemplar sequences, without prior knowledge of any patterns in the exemplar 
sequences, and without knowledge of the cardinality or characteristics of any features that 
may be present in the exemplar sequences. The cardinality of a set is the number of items 
in a set. For instance, the cardinality of the set of letters in the English alphabet is 26. In 
one aspect of the invention, a method is used to take a set of unaligned sequences and 
discover several or many patterns common to some or all of the sequences. These 
patterns can then be used to determine if candidate sequences are members of the family. 
In another aspect of the invention, a method is used to take a set of sequences and to 
determine a set of maximal patterns common to a number of sequences. The maximal 
patterns are determined without any previous knowledge about any properties or features 
that may be present in the processed sequences. 

A more complete understanding of the present invention, as well as further 
features and advantages of the present invention, will be obtained by reference to the 
following detailed description and drawings. 

Brief Description of the Drawings 

FIG. 1 is a schematic block diagram showing an architecture of a system 
for unsupervised building and exploitation of composite descriptors in accordance with 
an embodiment of the present invention; 

FIG. 2 is flow chart describing unsupervised building and exploitation of 
composite descriptors employed by the system of FIG. 1 ; 

FIG. 3 is a histogram of the scores for the sequences of RAND-SP when 
processed by the composite descriptor for an 80-sequence G protein-coupled receptor 
training set; and 

FIG. 4 is a histogram of the scores for the sequences of RAND-SP when 
processed by the composite descriptor for a 70-sequence helix-turn-helix training set. 
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Detailed Description of Preferred Embodiments 

5 Generally, the present invention provides a way of determining in an 

unsupervised manner additional members for a family that is defined initially through 
exemplar sequences. The present invention is unsupervised in that it proceeds without 
any information related to the exemplar sequences defining the family, without aligning 
the sequences, without prior knowledge of any patterns in the exemplar sequences, and 
10 without knowledge of the cardinality or characteristics of any features that may be present 
in the exemplar sequences. The cardinality of a set is the number of items in a set. For 
instance, the cardinality of the set of letters in the English alphabet is 26. In one aspect 
of the invention, a method is used to take a set of unaligned sequences and discover 
several or many patterns common to some or all of the sequences. These patterns can 
15 then be used to determine if candidate sequences are members of the family. In another 
aspect of the invention, a method is used to take a set of sequences and to determine a set 
of maximal patterns common to a number of sequences. The maximal patterns are 
determined without any previous knowledge about any properties or features that may be 
present in the processed sequences. 
20 As previously stated, the present invention provides a way of determining 

family members in an unsupervised manner. By "unsupervised" it is meant that no 
predetermined or a priori information is needed/known about the exemplar sequences or 
is employed by the discovery process. Additionally, there is no need for user supervision 
or intervention. For instance, the present invention does not require knowledge of 
25 biological information related to the family, aligned sequences, knowledge of properties 
of the exemplary sequences defining the family, and/or knowledge of the cardinality or 
characteristics of features of the exemplar sequences. It is possible to exclude one or 
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more of the these restrictions. For instance, the present invention could be used on a set 
of aligned sequences. The present invention would still determine a composite descriptor 
suitable for examining candidate sequences and either including these sequences in or 
excluding them from the family. However, a great benefit of the present invention is that 
5 it does not need aligned sequences or the knowledge of predetermined properties and 
features that may be present in the exemplar sequences. Aligning sequences and 
determining properties and features in the exemplar sequences that originally define the 
family is time consuming, complex, and at times intractable. Instead, the present 
invention can determine a composite descriptor without such time intensive efforts. 
1 0 Concerning features and properties of a sequence of symbols, it is not easy 

to define what a feature is. The definition of a feature is directly related to the 
representation of the items that are studied, i.e., the way each of the objects processed by 
the system under consideration is represented and stored in a computer. Such a 
representation is in turn related to the way an object can appear in the context of the 

15 sensor data, and is unavoidably application specific. For example, in the context of image 
processing by a computer, the following image characteristics have been used as features: 
linear and curvilinear segments, curvature extrema, curvature discontinuities, and 
identifiable conies. In the context of computational biology, an example of a feature can 
be a combination of amino acids with understood behavior and possibly known 

20 3-dimensional structure. For instance, for a helix-turn-helix (HTH) motif that mediates 
the binding of many regulatory proteins to regulatory control sites of DNA, the two 
features are the two helices at the beginning (7 a.a.) and the end (9 a.a.) of the 20 a.a. 
stretch that corresponds to an instance of the HTH motif. A property can be thought of as 
an attribute of a feature: in the case of the HTH, a property would be the fact that the two 

25 features (helices) are held together through non-polar interactions of their side chains. It 
should be stressed that the concept of the feature is also intrinsically connected to the task 
at hand. For example, for some applications, individual a.a. letters can be thought of as 



YOR920000435US1 



-7- 




"features." 

What is important is that previously researchers had to (a) know 
something about the set of sequences, or (b) align the exemplar sequences, or (c) perform 
both (a) and (b) before they could determine those motifs that were peculiar to the 
5 exemplar sequences and, thus, by extension specific to and characteristic of the family 
defined by the exemplar sequences. The researchers knew and exploited properties of 
sequences, knew and exploited features of the sequences, and/or aligned the sequences. 
The present invention is unsupervised, meaning that no information about the exemplar 
sequences need be known, and the present invention will still determine patterns that can 

10 subsequently be used to define the family implied by the exemplar sequences as well as 
analyze candidate sequences for inclusion into this family. 

In an embodiment of the present invention, a training set of family 
members is searched in an unsupervised manner to determine statistically significant, 
common patterns between some or all of the family members. Each family member 

15 comprises a sequence, which itself comprises a series of characters. The present 
invention may be used on any sequence of symbols that can be described as a linear 
stream of events, e.g., DNA (deoxyribonucleic acid), proteins, languages, and numbers. 
Preferably, a predetermined sequence-support threshold will initially be set. This 
predetermined sequence-support threshold determines how many of the sequences in the 

20 family need to have a pattern for the pattern to be considered common to the training set. 
For instance, if there are 100 sequences in the family, the predetermined 
sequence-support threshold could be set to 50. This means that a pattern must be found 
in 50 of the sequences for the pattern to be considered common to the family members in 
the training set. Generally, this threshold is initially set to the number of sequences in the 

25 training set. Should no common patterns be found, the sequence threshold may be 
modified. 

If common patterns are found, they are examined to determine if they are 
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statistically significant. Any remaining statistically significant patterns may be used to 
describe the family members and, subsequently, to ascertain if candidate members are 
part of the family. Preferably, the statistically significant and common patterns become 
part of a composite descriptor. Once the statistically significant and common patterns are 
found for a set (which could include all) of the family members, the sequences containing 
the patterns are removed from the training set. This results in a smaller training set. 

This modified training set is again searched for common patterns. The 
sequence threshold may be modified to search for fewer sequences of the modified 
training set or to search for all of the sequences in the training set. If any common and 
statistically significant patterns are found, the composite descriptor is modified to add the 
new patterns. This process preferably continues until either all sequences are removed 
from the training set or until common patterns cannot be found between the remaining 
sequences. 

Once the composite descriptor is determined, the composite descriptor 
may be used to determine if a candidate sequence is part of the family. In particular, the 
composite descriptor may be used to search a database of sequences to determine if 
individual sequences in the database are members of the family described by the 
composite descriptor. Usually, a pattern-support threshold will be used to make this 
determination. The pattern-support threshold determines the number of patterns that 
must match between the candidate sequence and the patterns in the composite descriptor. 
For example, if there are 1000 patterns in the composite descriptor, the pattern-support 
threshold may require matches on 995 of the patterns for the candidate sequence to be 
considered a member of the family. Moreover, after more members of the family are 
found by using the current composite descriptor, these new members may be added to the 
original training set to create a new training set. The composite descriptor method may 
again be run on the new training set. This will provide even greater sensitivity and allow 
the composite descriptor to "learn" new patterns common to the family. 
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While the present invention can determine statistically significant and 
common patterns with aligned sequences, the present invention does not need aligned 
sequences. To align two sequences, one or more patterns common to both sequences are 
aligned in a left-to-right order. For example, assume that the pattern being aligned is 
5 ABC. The sequence of characters {DEFXYZABC} would be aligned with {ABCDEF} 
by either aligning the ABC patterns in a left-to-right manner or by aligning the DEF 
patterns. Thus, when aligning the ABC patterns, the XYZ of the first pattern would not 
be aligned with characters in the second pattern and the DEF of the second pattern would 
not align with characters in the first pattern. For this example, there is no unique 

10 alignment and it is easy to see how the situation can be complicated further as the number 
of sequences to process increases. Because the present invention preferably searches for 
patterns common to the sequences, the present invention would determine that ABC was 
common to the two sequences, regardless of their alignment. 

The present invention also does not need the availability of biological 

15 information related to the family. While such information could be used, the present 
invention will determine statistically significant and common patterns within the family 
members without biological information. Moreover, because outliers are expected to not 
contribute much in the way of statistically significant patterns to the composite descriptor, 
outliers have less of an impact on the present invention. 

20 Turning now to FIG. 1, FIG. 1 is a schematic block diagram showing the 

architecture of an illustrative system 100 in accordance with the present invention. 
System 100 may be embodied as a general purpose computing system, such as the general 
purpose computing system shown in FIG. 1. System 100 includes a processor 110 and 
related memory, such as a data storage device 120, which may be distributed or local 

25 The processor 110 may be embodied as a single processor or a number of local or 
distributed processors operating in parallel. Such processors could communicate through 
a common bus or through one or more networks. The data storage device 120 is operable 
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to store one or more instructions and data, which the processor 1 10 is operable to retrieve, 
interpret, execute and use. Data storage device 120, in this example, comprises a 
composite descriptor method 200, a composite descriptor 130, a training set 140, a 
database 150, and discovered family members 160. Not all of these need be present at 
any one time. In general, the composite descriptor method 200 will examine the training 
set 140 for common and statistically significant patterns. Training set 140 comprises a 
number of sequences, each of which comprise a series of symbols. Each symbol comes 
from a collection of possible symbols referred to as an alphabet. The alphabet could 
describe such entities as DNA (deoxyribonucleic acid) or proteins. The composite 
descriptor 130 will be modified to add any common and statistically significant patterns 
that are found. Database 150 contains a number of candidate sequences. Once a 
composite descriptor 130 is created, the composite descriptor may be used to determine 
which, if any, of the candidate sequences in the database 150 are part of the family of 
sequences described by composite descriptor 130. If any candidate sequences are 
determined to belong to the family, these candidate sequences may be stored in the 
discovered family members area 160. If desired, the discovered family members 160 
may be added to the training set 140 to create a new training set 140. Composite 
descriptor method 200 may then act on this new training set 140 to further refine 
composite descriptor 130. 

As is known in the art, composite descriptor method 200 may be 
distributed as an article of manufacture that itself comprises a computer readable medium 
having computer readable code means embodied thereon. The computer readable 
program code means is operable, in conjunction with a computer system such as 
computer system 100, to carry out all or some of the steps to perform the composite 
descriptor method 200. The computer readable medium may be a recordable medium 
(e.g., floppy disks, hard drives, Compact Disks, or memory sticks), or may be a 
transmission medium (e.g., a network comprising fiber-optics, the world-wide web, 
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cables, or a wireless channel using time-division multiple access, code-division multiple 
access, or other radio-frequency channel). Any medium known or developed that can 
store information may be used. 

Composite descriptor method 200, as shown in FIG. 2, performs 
unsupervised building of composite descriptors and then exploits the determined 
composite descriptors to find additional family members of the families described by the 
composite descriptors. Method 200 is performed whenever it is desired that a composite 
descriptor be determined and used. It should be noted that method 200 may be broken 
into multiple sections. Preferably, the steps up to step 280 would be used to determine a 
composite descriptor from a training set, while optional step 260 would be used to apply 
the composite descriptor to one or more candidate sequences, and optional steps 270 and 
275 would be used to further refine the composite descriptor. 

Method 200 begins in step 205 when a training set is provided. It should 
be noted that the sequence of steps are not necessarily in order. The training set, T, is 
preferably N unaligned sequences Si for which there is reason to believe that the 
sequences are related. There should exist identifiable local similarities among members 
of T at the amino acid level, although it is assumed that no other information is available 
for the members of T, e.g., known or identifiable secondary structures, known or 
identifiable domains, functional information, physio-chemical properties, or physical 
properties. If no identifiable local similarities exist among members of T, method 200 
will not provide a suitable composite descriptor for the family, as a composite descriptor 
does not exist for the family. 

Each sequence is a series of symbols from an alphabet. For proteins, one 
can denote by I the alphabet of all amino acids; i.e., S={A, C, D, E, F, G, H, I, K, L, M, 
N, P, Q, R, S, T, V, W, Y}. On this alphabet, regular expressions can be defined that can 
range from very simple n-grams to more general ones containing wild cards and capturing 
strings of variable length. The V (referred to as the "don't care character") is used to 
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denote a position in a sequence or pattern that can be occupied by an arbitrary residue. A 
bracket is meant to denote a "one of " choice; i.e., [KR] means that the position this 
bracket corresponds to can be occupied by exactly one of K or R. A bracket can have a 
minimum of 2 alphabet characters but not more than |s| - l . 

In step 210, the sequence threshold, K, is set. It is possible to set K=|T|, 
which is the number of sequences in th^training set. In actuality, it has proven beneficial 
to assign a small starting value to^K that is a fraction of the number of sequences in T. 
Experiments have shown that/starting value of K=|T|/b with b=4 or 5 is good choice 
across many data sets. Note that the smaller the value of b, the higher the redundancy of 
the composite descrmtor will be. The selection of K also can depend on how conserved, 
or similar, the family members are. If the family members are well conserved, then K can 
be higher; ifpie family members are not well conserved, then K can be lower. 

In step 215, a set of maximal patterns in the K sequences is determined. In 
general, this step tries to determine common patterns between the K sequences. Not only 
should the patterns be common, but they should also be as large as possible. These large 
patterns may further be mathematically defined as "maximal" in a way described below. 
Any of the available algorithms which can guarantee that all sought patterns are 
discovered and that they are maximal can be used here. For the experiments related 
below, a Teiresias algorithm was used. This algorithm is described in Floratos, et al., 
U.S. Patent No. 6,108,666, "Method and Apparatus for Pattern Discovery in 
1-Dimensional Systems"; Floratos, et al., U.S. Patent No. 6,092,065, "Method and 
Apparatus for Discovery, Clustering and Classification of Patterns in 1-Dimensional 
Event Streams"; Rigoutsos, I. and A. Floratos, "Combinatorial Pattern Discovery in 
Biological Sequences: the Teiresias Algorithm," Bioinformatics, 14(l):55-67, 1998; and 
Rigoutsos, I. and A. Floratos, "Motif Discovery Without Alignment Or Enumeration," 
Proceedings 2nd Annual ACM International Conference on Computational Molecular 
Biology, New York, NY, March 1998, the disclosures of which are incorporated by 
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reference herein. 

A short introduction to this method follows. A pattern S is a regular 
expression on E that defines a language G(S). The elements of the language are all the 
strings that can be obtained from the regular expression that S stands for. A protein is 
said to match a given pattern S if and only if it contains at least one substring (i.e., a 
block of consecutive residues) that belongs in G(S). A pattern S' is said to be more 
specific than a pattern S if G(S')cG(S). Given a pattern S and a database D, an offset list 
of a pattern of S may be defined with respect to D (or simply the offset list of S, when the 
database D is unambiguously implied) to be the following set: L D (S) = {(i, j) | the i-th 
sequence of the database D matches the pattern S at offset j}. A pattern S is called 
maximal with respect to a database D if there exists no pattern S' which is more specific 
than S and such that |L D (S)| = |L D (S')|. A maximal pattern cannot be made more specific 
without simultaneously reducing the cardinality of its offset list. A pattern S is called an 
<L,W> pattern (with L<W) if every substring of S with length W contains L or more 
non-don't care positions. Note that a given choice for the parameters L and W has a 
direct bearing on the degree of remaining similarity among the instances of the domain 
that is captured by the regular expression: the smaller the value of the ratio L / W, the 
higher the degree of sought similarity. 

The Teiresias algorithm is a pattern discovery algorithm that can guarantee 
the discovery of all <L,W> patterns that are maximal and supported by K or more input 
sequences. The pattern discovery is carried out while allowing the symbols of 2 to be 
partitioned in equivalence classes. Any symbol within a given class is able to replace any 
other symbol of the (same) class. One such example would be the partition: {A, G}, C, 
{D, E}, {F, Y}, H, {I, L, M, V}, {K, R}, {N, Q}, P, {S, T}, W. In fact, the various 
symbol classes do not have to form a partition of X. In other words, a given symbol can 
belong to more than one class. One such set of classes can be obtained by using a 
distance threshold with any of the currently available scoring matrices such as the PAM 
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and BLOSUM series. PAM is described in Dayhoff, "Atlas of Protein Sequence and 
Structure," vol. 5, National Biomedical Research Foundation, 1978; and BLOSUM is 
described in Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. 
Natl. Acad. Sci. USA, 89:100915-100919, 1992, the disclosures of which are 
5 incorporated by reference. 

The Teiresias algorithm permits the discovery of all <L,W> patterns that 
are maximal and supported by K or more input sequences, in the presence of stated 
equivalences involving symbols from the input alphabet. Each pattern S that the 
Teiresias algorithm will discover is of the form: 

10 

(S U [2Z*2]) (2 U [22*1] U {.})*(! U [12*1]). 

Associated with each pattern S is the sensitivity of the pattern, which is 
directly related to the number of sequences in D that contain S. The sensitivity is a 

15 measure of how many members of the training set T do not match S (= false negatives). 
Also associated with S is the pattern's specificity, which is a direct measure of how many 
members of the database D match the pattern, but are not true members of the collection 
that the training set T represents (= false positives). The choice of the values for the 
parameters L and W is a function of the collection under consideration. Experimental 

20 work has shown that a choice supporting moderate degree of local similarities (e.g., 
-40-50%) is a good choice across a very large variety of test cases. 

In step 225, it is determined if any patterns are found. In no patterns are 
found (step 225 = NO), the sequence threshold, T, can be decreased. Preferably, this is 
done by setting K=|T|/b, where b is usually set to 4 or 5. It is also possible to set b to 

25 smaller values, such as 2 or 3. Setting b to smaller values increases the amount of 
processing time it might take to determine maximal patterns. For instance, if there are 
1000 sequences in T and K = |T| = 1000, and no common maximal patterns are found, it 
is necessarily the case that changing K to 999 will not find any common maximal 
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patterns. Changing K from 1000 to 250, however, will make it more likely that common 
maximal patterns may be found. After K has been changed (step 230), it is determined if 
K meets a predetermined minimum limit. This limit has been set, in the example of FIG. 
2, as 2. If there are two (or even more) sequences that have a pattern, even a maximal 
pattern, in common, this pattern may not be representative of the family members. In step 
220, other minimum sequence-support thresholds may be used, if desired. The choice of 
the predetermined minimum limit is not critical, as outliers (those sequences that are the 
"edge" of the family or even not part of the family) are expected to have little or no 
bearing on the composite descriptor of the present invention. This is discussed in more 
detail below, in reference to step 260. 

If maximal patterns are found in step 215 (step 225 = YES), in step 235, it 
is determined if the maximal patterns are statistically significant. In general, in step 235, 
it is determined, for each maximal pattern, what the probability is that the maximal 
pattern occurs in a sequence. This probability should meet a predetermined threshold. 
This step is important because the patterns will be exploited, as part of the composite 
descriptor, to determine additional family members. If relatively general patterns are 
used, the patterns could include candidate members into a family when the candidate 
members are not members of the family. For instance, for the English language, the 
pattern "the" is much more likely to appear in a sentence than is the pattern "quit." The 
pattern "the" would be much more likely to include candidate members as part of the 
family than would the pattern "quit." This would be appropriate if the family was defined 
as any sentence having the pattern "the." However, a much more likely occurrence is to 
define a sentence as any sentence having the pattern "quit," and if the pattern "the" is 
used as part of a composite descriptor, it is possible that this pattern will generate too 
many false family members. 

From the set of maximal <L,W> patterns that are discovered, the set M s is 
selected that contains only those that are statistically significant. With appropriate 
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modifications, any of several published methods can be used at this step, the disclosures 
of which are herein incorporated by reference: Atteson, "Calculating the Exact 
Probability of Language-like Patterns in Biomolecular Sequences," Proceedings of the 
Sixth International Conference on Intelligent Systems for Molecular Biology (ISMB '98), 
Menlo Park, California, AAAI Press, 1998; Jonassen, ICollins, and Higgins, "Finding 
Flexible Patterns in Unaligned Protein Sequences," Protein Science, pp. 1587-1595, 
1995; Nicodeme, Salvy and Flajolet, "Motif Statistics," INRIA Technical Report No 
3606, January 1999; Pevzner, Borodovksi and Mironov, "Linguistic of Nucleotide 
Sequences: the Significance of Deviation from Mean Statistical Characteristics and 
Prediction of the Frequencies of Occurrences of Words," Journal of Biomolecular 
Structure Dyn., 6:1013-1026, 1989; Regnier, "A Unified Approach to Word Statistics," 
Proceedings 2nd Annual ACM International Conference on Computational Molecular 
Biology, New York, NY, March 1998; Sagot, and Viari, "A Double Combinatorial 
Approach to Discovering Patterns in Biological Sequences," Proceedings of the Seventh 
Symposium on Combinatorial Pattern Matching, pp. 186-208, 1996; Sewell and Durbin, 
"Method for Calculation of Probability of Matching a Bounded Regular Expression in a 
Random Data String," Journal of Computational Biology, 2(1):25-31, 1995; and 
Wooton, "Evaluating the Effectiveness of Sequence Analysis Algorithms Using Measures 
of Relevant Information," Computers Chem., 21(4):191-202, 1997. 

For simplicity, the probabilities of the discovered patterns, as disclosed in 
the Examples section below, were determined with the help of a 2nd order Markov chain 
method, as described in Salzberg, Delcher, Kasif, and White, "Microbial gene 
identification using interpolated Markov models," Nucleic Acids Res., 26(2):544-8, 1998, 
which is incorporated herein by reference. The natural logarithm of the estimated 
probability was used as the measure of a pattern's significance. This threshold can be 
estimated as a function of the size of the database to be searched with the composite 
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descriptor. 

The cardinality of the sub-selected set M s of patterns ought to be high 
because of the redundancy of sequence segments from T that are captured by the patterns. 
This will guarantee a strong signal-to-noise ratio when the composite descriptor is used as 
5 a predicate. It is worth pointing out at this point that even if the training set has just a few 
members, the cardinality of M s (and thus the redundancy) can be high since there is a 
multitude of patterns that one can generate even from a few sequences. 

Once the statistically significant patterns are found, these patterns are 
removed from the training set, T, of sequences. This occurs in step 240. It should be 

10 noted that steps 240, 245 and 250 do not have to occur in this order and could even occur 
in parallel. Preferably, each sequence of the training set is examined to determine 
whether it matches any of the significant patterns of M s . After all patterns of M s have 
been exhausted, all sequences that matched one or more patterns are added to a temporary 
set A. Upon completion of the iteration, one or more sequences from T will have been 

15 entered into the set A; these are essentially the sequences that have been accounted for. 
What remains of T after the removal of these sequences, i.e., T \ A, is used as the training 
set for the next iteration. Thus, the training set T is modified (step 245), which could 
include marking which sequences in an array of sequences are no longer valid, or copying 
the remaining sequences into a new array. 

20 In step 250, the composite descriptor is modified. Preferably, the 

composite descriptor is a union of the composite descriptor and the set M s . The set of 
significant patterns M s which was discovered during this last iteration is added to the 
composite descriptor by adding those patterns in M s that are currently not in the 
composite descriptor. 

25 In step 255, it is determined if the training set, T, is empty. If the training 

set is not empty (step 255 = NO), the method continues in step 215 and repeats. If the 
training set is empty (step 255 - YES), and after step 220 = YES, the method ends in step 
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280. Optionally, Steps 260, 270 and 275 may be performed at this point. 

At the end of this stage, the composite descriptor contains a set of patterns 



that by design are specific and sensitive for the collection that the training set T 
represents. Several properties distinguish this composite descriptor from previous 
collections of patterns, such as the Prints database of patterns. For example, the building 
of the composite descriptor is automatic, it does not require manual intervention and does 
not necessitate the computation of multiple alignments. Additionally, there is no need for 
biological knowledge specific to the training set T that will impose helpful constraints 
during generation of the composite descriptor. Also, highly similar sequences need not 
be removed from the training set prior to the building of the composite descriptor. 
Additionally, as discussed below in reference to step 260, the training set can safely 
contain a small percentage of potential outliers, i.e., sequences that have questionable 
membership in the collection that the training set represents. Because of the redundant, 
iterative nature of the building phase, the resulting composite descriptor is not expected to 
contain any statistically significant patterns that are shared by both the outliers and the 
rest of the sequences in T. Through the initial selection of the support value (small K) 
the composite descriptor can be made sensitive and contain patterns that are specific for 
the set T (i.e., large probability threshold, Thr prob ). Finally, the fact that the composite 
descriptor contains all those patterns which are specific, significant, and which by design 
account for every member of the training set, guarantees a strong signal-to-noise ratio 
when using composite descriptor as a multi-valued predicate (which takes place in step 
260). Steps 205 through 255 may be expressed in pseudo-code as follows. 

i) CompDescr <- 0 

ii) K HT| ( or K <- ™ax(2, |T|/b ) - see also text) 

iii) discover the set M of all <L,W> maximal patterns in T 



IV 




it ( K = 2 ) terminate ; 
set K - K-l (or K - max(2, KTb ) 



continue with step iii) 
end-if 
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In step 260, the composite descriptor is exploited to determine if candidate 
sequences are members of the family described by the composite descriptor. Generally, a 
database of sequences (such as database 150 of FIG. 1) will be searched, but individual 
sequences may also be compared against the composite descriptor. The composite 
5 descriptor will be a number of common, statistically significant and maximal patterns that 
describe the family. As such, the composite descriptor acts much like a dictionary to 
describe the family. It can be used in step 260 to determine additional members of the 
family. 

Because method 200 relies on searching through a family and determining 

10 the common, statistically significant and maximal patterns that compose the composite 
descriptor, outliers tend not to matter as much for the present invention. An outlier is a 
sequence that has been erroneously included within the family. Some simple examples 
will help to explain why outliers are not a hindrance to the present invention. 

Assume that there are 100 members of the family; assume also that 93 

15 members of the family are accounted for but there are 7 outliers that were erroneously 
included as members of the family. Since, by definition, the latter set comprises the 
outliers, it is generally true that the number of patterns that will be shared among them 
and the remaining 93 sequences should be very small (if not 0) when compared to the 
number of patterns that will be shared by the 93 truly related sequences. This will thus 

20 generate very small (if any) support for sequences that are not true members of the family 
being studied. Moreover, these erroneous patterns will be further filtered out through the 
statistical significance filtering stage. Finally, when the composite descriptor, which 
contains patterns common to all 100 sequences, is used to determine if a new sequence is 
part of the family, the composite descriptor will be used with a pattern-support threshold. 

25 In other words, there will be some minimum number of patterns that the new sequence 
must have in order to be considered part of the family. This threshold will usually be 
high enough such that outliers, even if they contribute patterns, will not cause non-family 
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members to be included within the family. 

In step 260, the composite descriptor can be used as a multi-valued 
predicate that can determine the membership of a query sequence in the collection that the 
original training set T defines. The composite descriptor can be used to examine a 
candidate-for-membership-in-T sequence S cand for instances of the permitted patterns. 
Given Scand, as many local counters as the length of the sequence may be allocated and 
initialized to 0. A global counter for the sequence may also be allocated and also 
initialized to 0. If it is determined that a segment of the query matches a pattern m, the 
local counters at the sequence positions matching the pattern are incremented by an 
amount equal to d. The possible choices for d include among others "the number of 
occurrences o m of m in T" and "the number 1." The former choice favors segments that 
match patterns supported by a lot of sequences in T whereas the latter gives 
comparatively increased support to segments that are only moderately conserved. The 
choice for the amount d by which to increment the local counters modifies the semantics 
of the predicate's output value. 

If the value of d is set to '1' then the predicate is a measure of how many 
distinct patterns generated from T are matched by the query sequence. In this case, large 
values indicate that the result is corroborated by multiple patterns which are specific for 
the collection T. Smaller values are at the very minimum indicative of the existence of 
local similarities that are shared by the query and one or more members of the training set 
T. Such similarities can imply one of two things: either the query is a true but distant 
member of the collection under consideration or it is not a true member but it nonetheless 
shares one more regions of similarity with members of the collection. 

If the value of d is set to 'the number of occurrences' of the respective 
pattern in the training set T, the predicate is a measure of how many distinct sequence 
fragments in T are similar to the respective query fragment. Large values indicate regions 
that are shared by a large number of sequences in T and can be indicative of a conserved 
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active site, for example. Both choices of d have merit and the one to use depends on the 
task at hand. 

Independent of what the choice for d is, every time a segment of S can d 
matches a pattern m, the global counter associated with S cand is incremented by d. After 
all of Sca„d have been examined, the values of the global counter are inspected for S ca nd; if 
they exceed Thres rand , S ca „d is reported as a candidate for membership in the collection 
defined by T. 

The value of Thres ra „ d depends on the actual contents of the composite descriptor and 
can be determined as follows: beginning with the composite descriptor that was built 
from the training set T, one can scan as outlined above a randomized version of a very 
large database such as GenPept or Swiss-Prot. Essentially, each sequence of such a 
database is treated as a potential query. Upon completion of the scanning process, one 
can accumulate support for all the sequences that matched one or more patterns of the 
composite descriptor and histogram the support values to obtain their distribution. The 
value of Thresrand may be determined by identifying the q-th percentile of this last 
distribution. Typically, q is set to 95 or higher. 

After step 260 has been performed, it is possible to take the new members 
found and add them to a new training set that comprises the old training set and the new 
members. Then steps 205 through 280 may be run again (step 275) to further refine the 
composite descriptor for this family. Thus, the present invention allows learning to be 
performed, if this is desired. 

The present method does not suffer from drawbacks related to (a) the need 
for good multiple sequence alignments, (b) the inclusion of outliers, (c) the inherent 
dependence of the results on the selection of the scoring matrix that is used, and (d) 
overtraining. Indeed, building of the composite descriptor does not require the 
computation of any multiple sequence alignments, whereas the redundancy of 
representation that is inherent in composite descriptor is expected to more than 
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counterbalance the inclusion of any small number of outliers. Additionally, this will 
prevent the system from including even more outliers during the following iteration. 
Moreover, after each iteration, only the sequence fragments whose support exceeds 
threshold are considered thus allowing the process to remain 'focused' on what has been 
5 deemed important and relevant for the dataset under consideration. 

Finally, it should be noted that the training set T, which is given at the 
very beginning of this iterative process, impacts on the quality of the results (i.e., 
sensitivity and specificity) that the method will produce. For example, if the original 
training set is not sufficiently representative of all instances of a family's members (e.g. 
1 0 GPCRs), or of the construct of interest (e.g. the helix-turn-helix DNA binding motif), the 
generated composite descriptor should not be expected to discover all instances relating 
to the training set. This last observation holds true for all methods that try to build single 
or composite descriptors by starting with a training set T. Since the augmented training 
sets at the beginning of the i+l-st iteration preferably only comprise the sequence 
15 fragments which exceeded threshold during the i-th iteration, the composite descriptor 
will maintain its 'focus' on what is essentially dictated by the original training set. That 
is not to say that that the composite descriptor will not be sensitive; on the contrary, the 
composite descriptor will be sensitive to the extent that the processed data permit while at 
the same time remaining in lock-step, so to speak, with the originally provided training 
20 input. As a matter of fact, the experimental results discussed below on three specific 
datasets demonstrate that even starting with small training sets allows discovery of a large 
number of representatives of the same group. 



EXAMPLES 

Now that the method and apparatus have been described, some exemplary 
results are shown in this section. In this section, results are described from the building 
and use of composite descriptors for three distinct collections of data. The collections 
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were chosen in such a way so as to showcase the ability of the present invention to handle 
input sets across a variety of contexts. 

The first collection comprises sequences from PROSITE entry PS50040 of 
elongation factor 1 gamma chain sequences; in Release 15.0 of the PROSITE database, 
5 only a matrix profile is available for this collection. 

The second collection comprises complete sequences as well as fragments 
of G protein-coupled receptors, a very important and diverse family of proteins that has 
traditionally been used as a benchmark test for gauging the quality of pattern-based 
approaches. 

*0 Finally, the third collection comprises sequence fragments that are known 

to contain an instance of the helix-turn-helix DNA binding motif, a structural motif of 
great importance. 

First, the composite descriptors were built for each of the three collections 
and evaluated by treating the sequences in Swiss-Prot Release 38.0 as candidates for 
1 5 membership in each of the respective three collections. 

Once the behavior of the descriptors is characterized in the context of 
Swiss-Prot, the 19,099 ORFs were searched in the complete genome of Caenorhabditis 
elegans and these results reported below. 

Before proceeding, here are some methodological details and parameter 
20 choices that are common in all three cases. In particular, the value d, by which the 
counters are incremented, is set to 1, essentially favoring those sequences that contain 
more instances of distinct patterns over others. The value of Thr prob is determined by 
assuming that the patterns ought to be able to discriminate among sequences in a database 
as large as GenPept; although for a database of this size an estimated log-probability of 
25 -25 or less ought to suffice. Thus, the more stringent threshold of Thr prob = -30 was used 
with the understanding that this will result in a sacrifice in sensitivity. But as the results 
will demonstrate, even with this stringent threshold, the redundancy of each composite 
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descriptor leads to a sensitivity that is satisfactory. Also, in all three cases the following 
a.a. equivalences are assumed: {A, G}, C, {D, E}, {F, Y}, H, {I, L, M, V}, {K, R}, {N, 
Q},P, {S,T}, W. 

The First Example: EF1G / PS50040 

An application of the above described methodology is in the context of the 
PROSITE database. Although numerous entries in PROSITE contain succinct and 
specific patterns capturing most or all of the members of the corresponding collection, 
there exist entries for which only a profile/matrix is available: PS50040, the family of 
elongation factor 1 gamma chain proteins is one such example thus making it an ideal 
candidate for processing with the described method. 

PS50040 comprises 10 full sequences (EF1GARTSA, EF1 G_C AEEL, 
EF1GHUMAN, EF1GRABIT, EF1G_SCHP0, EF1GTRYCR, EF1G_XENLA, 
EF1G_YEAST, EF1H_XENLA, EF1H YEAST) and 1 fragment (EF1G_PIG). The 
reported profile matrix captures all 10 full sequences, misses the one fragment and 
generates no false positives when the target database is Swiss-Prot Rel. 38.0. 

It should be noted here that if one relaxes the constraints imposed by the 
chemical equivalence classes shown above, it is possible to discover a specific pattern 
that belongs to all 1 1 members of PS50040 and generates no false positives when used in 
conjunction with Swiss-Prot Rel. 38.0. In fact, this pattern is 

[ILMV]..[NW][ILMV]..[AG]...[RI][ILMV]....[KT]..F....[ILMV].[GH] [AG] 

and can be used to describe and capture elongation factor 1 gamma chain proteins; the 
deviations from the above chemical equivalence classes are shown in boldface. 

The composite descriptor was built for this collection by setting the 
Teiresias parameters to L=5 and W=10; since the dataset is small there was only a single 
iteration over the dataset with a threshold choice of K=6. In other words, the composite 
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descriptor was built by discovering patterns that involved a minimum of 5 non-wild cards 
in any rolling window that spans 10 positions and begins/ends with a literal, a relatively 
high-degree of local similarity (i.e. 50% or higher). Those patterns whose estimated 
log-probability was equal to -30.0 or less were selected and this generated a composite 
5 descriptor that comprised 2,260 patterns. 

First, a corresponding DFA (deterministic finite automaton, which will 
only recognize instances of the composite descriptor patterns in a query sequence and 
which performs method step 260) was used to search a randomized version 
RAND-Swiss-Prot of Swiss-Prot (Release 38.0) that was obtained by applying a 
10 randomly chosen permutation to the amino acids of each of the valid sequences. Both the 
composition and lengths of individual sequences were maintained by this operation. The 
global counter for each randomized sequence was derived by summing up the local 
counters from each sequence region that received non-zero support. The sequences were 
then sorted in order of decreasing global-counter value. Twenty seven (27) randomized 
15 sequences received non-zero support with global counter values that ranged between 1 
and 2 inclusive. Thres rand was thus set to 3, and the DFA was subsequently used to search 
the actual Swiss-Prot database. Of the 69 sequences that received non-zero support, only 
16 exceeded the predefined threshold. The support values for the 16 sequences were: 
EF1G_HUMAN 861, EF1G RABBIT 846, EF1G_XENLA 791, EF1H_XENLA 765, 
20 EF1G ARTSA 349, EF1G_CAEEL 228, EF1G_YEAST 110, EFlG_SCHPO 110, 
EF1H_PIG 96, EF1H_YEAST 94, EF1G TRYCR 88, SYV_FUGRU 7, GTT1_RAT 5, 
GTT1JVIOUSE 5, SYEP_HUMAN 3 and GTH4JV1AIZE 3. 
dO°3^) Note that the 5 hits y^YV_FUGRU, GTT1RAT, GTT1 MOUSE, 



SYEP HUMAN and GTH4_MAIZB / are clearly separated from the 11 top scoring 
25 sequences. They do however obtained scores which were above threshold and thus are 
studied in more detail. In all S^cases, one or more sizeable regions that were shared with 
one or more members of the PS50040 collection were discovered. The Clustal-W 
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alignment of EF1G_XENLA and the N^erminus of SYV_FUGRU, a valyl-trna 
synthetase from Fugu rubripes, are sho^n in Table 1 below. Table 1 shows a Clustal-W 
alignment of EF1G_XENLA an^the N-terminus of SYV_FUGRU, and this shows a 
strong similarity. As canb^seen, the similarity among these two sequences is pretty 
extended and the Clus^aMV score for the shown alignment equaled 462. 

Similar shared regions are present in GTT1RAT & GTT1 MOUSE (a 
glutathione s-transferase 5 from Rattus norvegicus and a glutathione s-transferase theta 1 
from Mus musculus respectively), SYEP_HUMAN (a multi-functional aminoacyl 
trna-synthetase from Homo sapiens) and GTH4JVIAIZE (a glutathione s-transferase IV 
from Zea mays). The Clustal-W alignments for these cases are shown in Tables 2 
through 4 below. Table 2 shows a Clustal-W alignment showing a substantial similarity 
between GTT1RAT, GTT1 MOUSE and EF1G ARTSA. The Clustal-W score is 1577. 
Table 3 shows a Clustal-W alignment between a fragment from EF1G CAEEL (a.a. 100 
through 243) and a fragment from SYEP HUMAN (a.a. 1 through 180) showing a shared 
region. The Clustal-W score for this alignment is 74. Table 4 shows a Clustal-W 
alignment showing a strong similarity between EF1G RABIT and GTH4 MAIZE. The 
Clustal-W score is 215. 

It should be noted that a search of MEDLINE has indicated that with the 
exception of the similarity between the EF1G family and the valyl-tRNA from Fugu 
rubripes, none of the other similarities shown here has been reported in the literature. 

In summary, the composite descriptor has correctly picked out the 
members of PS50040 from the contents of Swiss-Prot as well as has identified several 
substantial similarities with other sequences in the database. 
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Table 1 



EF1G_XENLA ^GGTLYTYPDTORAYKPLIAAQYSGFPIia/ASSAPEFQFGVTNKTPEFLKKFPLGKVPA 
SYV_FUGRU_piece MA- -TLYVSP HLDDFR3LLALVAAEY 



* * * * * * 



■ * * * . 



EF1G_XENLA ^ FEGKDGFCLFESSAIAHYVGNDELRfiTTRLHQAQVIQWVSFSDSHIVPPASAWVFPTLGI 

SYV_FUGRU_piece C GNAK^ QSQVWQWLSFADNELTPVSCAWFPLMGM 

* : * * **:**:*..:.* :.* ** * : * : 

EF1G_XENLA MQYNKQATEQAKEGIKTVL^LDSHLQTRTFLVGERITLADITVTCSLLWLYKQVLEPSF 
SYV_FUGRU__piece TGLDKKIQQNSRVELMR^KVLDQALEPRTFLVGESITLADMAVAMAVLLPFKYVLEPSD 
:*: :::: : /* * ***. * : . ******* *****..*, ..* .* ***** 

EF1G_XENLA RQPFGOTTRWFVTGTOQPEFRAVLGEVKLCDKMAQFDAKKFAEMQPKKETPKKEKPAKEP 
SYV_FUGRU_piece RNVLMm/TRWFTT&INQPEFLKVLGKISLCEKMVPVTAKTSTEEAAAVH- PDAAALNGPP 
* . : ***** * / **.***** ***.. **.**_ ^ ** f . * * + 

EF1G_XENLA KKEKEEKKKAAPTPAPAPEDDLDESEKALAAEPKSKDPYAHLP-KSSFIMDEFKRKYSNE 
SYV_FUGRUj?iece KTEAQLKKEAKKREKLEKFQQKKEMEAKKKMQPVAEKKAKPEKRELGVITYDIPTPSGEK 
* . * * : * ::.** :*::. :..*•■ 

EF1GJCENLA DTLTVALPYFW - EHFDKEGWS I WYAE Y - KFPEELTQAFMS CNLI TGMFQR - LDKLRKTGF 

SYV_FUGRU_piece KDWSPLPDSYSPQYVEAAWYPWWEKQGFFKPEFGRKSIGEQNPRGIFMMCIPPPNVTGS 
■ •** ::::.**:: **:::.: *:* : . ** 

EF1G_XENLA / ASVI LFGTNNNSS I SGVWV - FRGQDLAFTLSED WQIDYESYNWRKLDSGSEEC- - 

SYV_FUGRU_piece LHLGHALTNAIQDTLTRWHRMRGETTLWNPGCDHAGIATQVWEKKLMREKGTSRHDLGR 
/ : ** * : ** - . * * . * # *:.:..; 

EF1G_XENLA/ KTLVKEYFAWEGE FKNVGKPFNQG- KI FK 

SYV_FUGRU^_piece EKFIEEVWKWKNEKGDRIYHQLKKLGSSLDWDRACFTMDPKLSYAVQEAFIRMHDEGVIY 
/ :.:::*:*:.* :*::*_■■ * 
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Table 2 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTTl^RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



GTT1_M0USE 
GTT1_RAT 
EF1G ARTSA 



-vlelyldllsqpcraiyifakknnipfqmhtvelrkgehlsd/farvnpmkkvpamm-d 

- VLELYLDLLSQPCRAIYIFAKKNNIPFQMHTVELRKGEHLS^AFAQVNPMKKVPAMK-D 

vagklytypenfrafkaliaaqysgakleiaksfvfgetnk^aflksfplgkvpafesa 



* . * * * * . 



ggftlcesvaillylahk J- - _ - YKVPDHWYPQDLQARARV 

GGFTLCESVAILLYLAHK J. YKVPDHWYPQDLQARARV 

DGHCIAESNAIAYYVANETLRGSSDLEKAQIIQWMtFADTEILPASCTWVFPVLGIMQFN 



. * * * * * . * . 



DEYLAWQHTGLRRSCLRALWHKVMFPVFLGEQfpPETLAATLAELDVNLQVLEDKFLQDK 
DE YLAWQHTTLRRS CLRTLWHKVMFPVFLGEQ I RPEMLAATLADLDVNVQVLEDQFLQDK 
KQATARAKEDIDKALQALDDHLLTRTYLVQERITLADIWTCTLLHLYQHVLDEAFRKSY 



DFLVGPHISLADLVAITELMHPVGGGfSPVFEGHPRLAAWYQRVEAAVGKDLFREAHEVIL 
DFLVGPHISLADWAITELMHPVGGfSCPVFEGRPRLAAWYRRVEAAVGKDLFLEAHEVIL 
VNTNRWFITLINQKQVKAVIGDFKLCEKAGEFDP KKYAEFQAAIGSGEKKKTEKAPK 



KVKDCPPADLI I KQ KLMPRVLTMIQ 

KVRDCPPADPVIKQKLMPRyLTMIQ 

AVKAKPEKKEVPKKEQEEE^DAAEEALAAEPKSKDPFDEMPKGTFMMDDFKRFYSNNEET 



KS I P YFWEKFDKENYS 



IWYSEYKYQDELAKVYMSCNLITGMFQRIEKMRKQAFASVCVFG 



EDNDSSL6GIWVWRGQDLAFKLSPDWQIDYESYDWKKLDPDAQETKDLVTQYFTWTGTDK 



QjSRKFNQGKIFK 



Table 3 



EF1G_CAEEL (100-243) NFD KKTVEQYK- -NE^NGQLQVLDRVLVKKTYLVGERLSLADVSVALDLLPAF 

SYEP_HUMAN (1-18 0) MEHTEIDHWLEFSATKLSSCDSFTSTINELNHCLSLRTYLVGNSLSLADLCVWATLKGNA 

. . 4- . / . 



EF1G_CAEEL (100-243) QYVLDANARKSIVNV'HWFRTVVNQPAVKEV- -LGEVSLASS- VA-QFNQ- -AKFTELS- 
SYEP_HUMAN (1-18 0) AWQEQLKQKKAPVH^feWFGFLEAQQAFQSVGTKWDVSTTKARVAPEKKQDVGKFVELPG 

: : : :* : */* *** . * *. : .* : ** ... ** . .* ** 

EF1G_CAEEL ( 100 -243 ) - - - AKVAKS WaEKPKKEAKPAAAA- -AQP E DD-EPKEEKS-KDP- - 

SYEP_HUMAN (1-180) AEMGKVTVR^PPEASGYLHIGHAKAALLNQHYQVNFKGKLIMRFDDTNPEKEKEDFEKVI 

- **yf * . . * ** * . **.*..**. 
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Table 



EFIG 
GTH4" 


_RABIT 
_MAIZE 


MAAGTLYTYPENWRAFKALIAAQYSGAQy^VLSAPPHFHFGQTNRTPEFLRKFPAGKVPA 
-ATPAVKVYGWAISPFVSRALLALEEAGVDYELVPMSRQDGD-HRRPEHLARNPFGKVPV 
*:::.* . * : ./* * .* : ★ : . * * * > * . 


EFIG 
GTH4~ 


_RABIT 
_MAIZE 


FEGDDGFCVFESNAIAYYVS--/-NEELRGSTPEAAAQVVQWVSFADSDIVPPAST 
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The Second Example: G protein-coupled receptors 

The family of G protein-coupled receptors has a long evolutionary history 
and is of particular importance for signal transduction in all eukaryotes. Spanning the 
lipid bilayer of the plasma membrane with seven helices, they bind and form signal 
transducing couples that are at the center of many key processes such as visual excitation, 
olfaction, histamine secretion in allergic reactions, and chemotaxis. G protein-coupled 
receptors form a very diverse family and extensive studies have shown that single 
descriptor approaches do not suffice to characterize the family's members. 

Despite considerable efforts, very few membrane proteins have yielded 
high-resolution X-ray crystallographic data; this led to increased use of electron 
microscope approaches. The first such data were in fact obtained for bacteriorhodopsin, 
the bacterial analogue of rhodopsin, where a 3 A electron-microscopy reconstruction of it 
has established directly the presence of the seven transmembrane helices. The significant 
sequence similarity that the members of this family exhibit indicates that they ought to 
have the same topology. 

In order to demonstrate the power of the present invention and its ability to 
generalize, the experiment began with the contents of the GPCRDB as they existed in 
May 1998. Note that from this collection the hypothetical proteins from Caenorhabditis 
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elegans are excluded since it was intended to carry out GPCR-discovery in this genome. 
The bacterial analogues of rhodopsin as well as all listed G-proteins were also excluded. 
What was left was a total of 1,019 GPCR entries, of which 862 were complete sequences 
and 157 were fragments. This set was intersected with an older release of Swiss-Prot 
5 (Release 35.0 from November 1997) and determined that the intersection of the two 
databases comprised a total of 804 sequences and fragments. Starting with data that were 
almost two years old was intentional since it was important that the ability of the 
composite descriptors to generalize and identify additional candidate sequences in the 
much larger databases of today would be shown. 



classes (e.g. rhodopsin-like, secretin-like, pheromone, etc.) of proteins. In turn, each of 
these classes comprised several representatives. /Instead of selecting representatives from 
each of the identified classes, the order of the/Sequences in this set of 804 members were 
randomized. Note that the contents of tife sequence themselves remained unchanged, 
only their order of appearance was modified. For example, the 613-th sequence was now 
listed 4-th, the 11-th sequence ^now appeared in the 45-th position, and so on. 
Subsequently, a training set T was formed by collecting the sequences and fragments 
listed in the first 80 positiomyarguably a very small set if one considers the diversity of 



the GPCR family. Essentially, slightly less than 1/10-th of the available dataset were 
20 randomly sud-selected fiefr the purposes of building the composite descriptor. Table 5 
below contains a listing of the labels of the 80 sequences in this training set. Table 5 
shows the Swiss-Prot labels of the 80 sequences in the training set for the G 
protein-coupled^eceptor experiment. The labels are listed in the order they were selected 
and they correspond to both sequences and sequence fragments. 




The collection of 804 GPCR sequences and fragments contained several 





25 
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Table 5 



| 1 through 2 0 | 21 through 4 0 


41 through 60 


61 through 8 0 






EBI2 HUMAN 


OPSD CORAU 


ACM2 HUMAN 


PACR RAT 


ML1X_HUMAN 


ACM4_XENLA 


VIPR_MELGA 


CRFR_CHICK 


ACM 3 CHICK 


G10D MOUSE 


V1BR HUMAN 


OLF9 RAT 


P2YR MOUSE 


OLF4 CHICK 


MGR8 HUMAN 


ACM 4 MOUSE 


OAR DROME 


ACM3 PIG 


SSR4 RAT 


NY4R MOUSE 


AAIR CHICK 


5H1A MOUSE 


HH1R MOUSE 


5HTB DROME 


MAM2 SCHPO 


MSHR BOVIN 


NK2R RAT 


GU3 8 RAT 


SCRC RAT 


OLF5 RAT 


MSHR HUMAN 


PF2R BOVIN 


PAFR CAVPO 


GU03 RAT 


A2AA PIG 


OPSB GECGE 


ACM3 RAT 


P2YR BOVIN 


B3AR BOVIN 


AA3R HUMAN 


OLIJ HUMAN 


GPCR_LYMST 


OPSB HUMAN 


MC3R MOUSE 


GPRJ_MOUSE 


FMLR RABIT 


GPRO HUMAN 


BAR 2 SCHCO 


D4DR MOUSE 


BIAR HUMAN 


5H2A CRIGR 


CRFR HUMAN 


ML1C CHICK 


D3DR RAT 


PER4 MOUSE 


MC4R RAT 


PER2 RAT 


PF2R MOUSE 


OPSD CATBO 


OPSB ANOCA 


OPRX PIG 


PERI RAT 


ACMl RAT 


IL8A RAT 


AAIR HUMAN 


GRPR MOUSE 


OPS2 SCHGR 


AAIR CAVPO 


DOPl DROME 


GRFR PIG 


GRHR HUMAN 


AG2R HUMAN 


OXYR RAT 


NK1R RANCA 


NK1R RAT 


GPRM HUMAN 


B3AR MOUSE 


OLFl HUMAN 


EDG2 SHEEP 


CASR HUMAN 


EBI2 HUMAN 


OPSD CORAU 


ACM2 HUMAN 


PACR RAT 


ML1X HUMAN 


ACM4 XENLA 


VI PR MELGA 


CRFR CHICK 


ACM 3 CHICK 


G10D MOUSE 


V1BR HUMAN 


OLF9 RAT 


P2YR MOUSE 


OLF4 CHICK 


MGR8 HUMAN 


ACM4 MOUSE 


OAR DROME 


ACM3 PIG 


SSR4 RAT 


NY4R MOUSE 



5 As in the previous example, the patterns were discovered assuming the 

equivalence classes {A, G}, C, {D, E}, {F, Y}, H, {I, L, M, V}, {K, R}, {N, Q}, P, {S, 
T}, W. The Teiresias parameters were set to L=5, W=10, whereas the successive 
threshold choices were K=80, K=16 and K=3. It was set out to discover patterns that 
involved at least 5 non-wild cards in any rolling window that spans 10 positions and 

10 begins/ends with a literal, which is a relatively high-degree of local similarity (i.e., 50% 
or higher). Those patterns whose estimated log-probability was equal to -30.0 or less 
were selected and this generated a composite descriptor that comprised 1,703 patterns. 

First, the corresponding DFA (deterministic finite automaton, which will 
only recognize instances of the composite descriptor patterns in a query sequence and 

15 which performs method step 260) was used to search a randomized version 



YOR920000435US1 



-32- 



RAND-Swiss-Prot of Swiss-Prot (Release 38.0) (see also relevant discussion in the 
PS50040 example). The sequence regions with non-zero local counters were identified 
and the maximum counter values from each such region were summed up; the sum-total 
was attached to the sequence label and the sequences were sorted in order of decreasing 
sum value. A total of 1,564 sequence fragments from RAND-Swiss-Prot received 
non-zero support and the actual histogram of these values is shown in Fig. 3. Of those 
1,564 fragments, 1,548 received a support value that was less than 9. Thus Thres rand =10 
was selected; this threshold choice corresponded to the 99-th percentile. 

Subsequently, the same DFA was used to search the actual Swiss-Prot 
database testing each of its 80,236 sequences for membership in the G protein-coupled 
receptor family. Sum values were attached to each sequence as above and only 947 
sequences from Swiss-Prot that received support greater than or equal to Thres rand =10 
were kept. 

In order to determine the quality of the composite descriptor and determine 
the number of true and false positives that the descriptor gives rise to, the Swiss-Prot 
annotation (keyword "KW" lines) was used for each of these 947 sequences. Of these 
retrieved sequences, 928 are actually listed as 'G protein-coupled receptor's, 10 are 
eukaryotic transmembrane proteins (SUR7_YEAST, C561_HUMAN, YIPCYEAST, 
NU4M_APFME, SCG2_XENLA, GTR2 LEIDO, GARP HUMAN, CIN6_HUMAN, 
CIN3_RAT, PLSCCOCNU), 2 are hypothetical eukaryotic transmembrane proteins 
(YJZ3YEAST, YMJ C_C AEEL), 2 are hypothetical proteins (YKY4_YEAST, 
YCX7_ YEAST), and finally 5 are bacterial false positives (PIP_BACCO, 
VJRR AGRT6, YQGP_BACSU, HBD CLOTS, PROA HAEEN). 

This is a very notable result, given the comparatively small amount of 
information that is captured by the 80-sequence input set and the diversity of the G 
protein-coupled receptor family. Table 6 below contains a listing of the labels of the 947 
Swiss-Prot sequences whose support exceeded threshold; the labels are listed in order of 
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decreasing value of the global counter that was associated with the corresponding 
sequence, and the 5 false positives are shown in boldface. Table 6 shows the labels of the 
947 sequences from Swiss-Prot Release 38.0 that received support above threshold in the 
G protein-couple receptor example. The 5 false positives are shown with an "(FP)." 



Table 6. 

1 through 50 | 51 through 100 I 101 through 150 | 151 through 200 I 201 though 750 



OAR DROME 


B2AR CANFA 


A2AC DIDMA 


5H1A RAT 


MSHR CEREL 


B3AR MOUSE 


AA1R RAT 


ACM4 XENLA 


5H1A MOUSE 


MSHR CAPHI 


B3AR CAVPO 


AA1R RABIT 


ACM1 MOUSE 


5H1A HUMAN 


MSHR CAPCA 


B3AR RAT 


AA1R HUMAN 


AA3R SHEEP 


SCRC RABIT 


MSHR ALCAA 


OAR HELVI 


AA1R CANFA 


AA2B CHICK 


A1AA CANFA 


VI PR CARAU 


OAR BOMMO 


AA1R CHICK 


ACM4 MOUSE 


MC5R HUMAN 


VI PR HUMAN 


OAR2 LOCMI 


B1AR XENLA 


ACM4 HUMAN 


AA3R CANFA 


HH1R BOVIN 


OAR1 LOCMI 


NK1R RANCA 


5H4 CAVPO 


5H4 RAT 


5HT LYMST 


GREC BALAM 


B1AR SHEEP 


AA2B HUMAN 


NK2R CAVPO 


HH1R RAT 


B1AR RAT 


AA1R BOVIN 


NK1R RAT 


SSR1 RAT 


HH1R HUMAN 


B1AR MOUSE 


A2AA CAVPO 


NK1R MOUSE 


SSR1 MOUSE 


HH1R CAVPO 


B1AR PIG 


A2AA HUMAN 


MC5R RAT 


SSR1 HUMAN 


SSR3 HUMAN 


B1AR MACMU 


5H2B HUMAN 


MC5R MOUSE 


NK4R HUMAN 


5HT BOMMO 


B1AR HUMAN 


AA1R CAVPO 


MC5R BOVIN 


OPRK RAT 


NK2R HUMAN 


B1AR CANFA 


ACM3 PIG 


MC5R SHEEP 


OPRK MOUSE 


NK3R RAT 


B1AR MELGA 


ACM3 HUMAN 


AA3R RAT 


OPRK HUMAN 


NK3R MOUSE 


B4AR MELGA 


ACM 3 CHICK 


DOP1 DROME 


OPRK CAVPO 


NK3R HUMAN 


B3AR BOVIN 


ACM 3 BOVIN 


AA2B RAT 


5HTA DROME 


SSR3 RAT 


B3AR CANFA 


A2AC_HUMAN 


AA2B MOUSE 


AA3R RABIT 


SSR3 MOUSE 


PACR RAT 


5H7 RAT 


A1AB RAT 


A2AB RABIT 


MSHR MOUSE 


B3AR MACMU 


5H7 MOUSE 


A1AB MOUSE 


A2AB MACPR 


GRFR PIG 


B3AR HUMAN 


5H7 HUMAN 


A1AB MESAU 


MC3R MOUSE 


NK2R RABIT 


B2AR MOUSE 


5H7 CAVPO 


A1AB HUMAN 


MC3R HUMAN 


NK2R BOVIN 


SCRC RAT 


A2AD HUMAN 


D3DR HUMAN 


5H1B FUGRU 


VIPR PIG 


B2AR RAT 


A2AC RAT 


D3DR CERAE 


ACM4 RAT 


5H1A FUGRU 


B2AR MESAU 


A2AC MOUSE 


NK1R HUMAN 


A2AB TALEU 


5HTB DROME 


B2AR BOVIN 


A2AC CAVPO 


NK1R CAVPO 


A2AB PROHA 


VI PS HUMAN 


B2AR PIG 


VI PR RAT 


SSR4 RAT 


A2AB ORYAF 


MC4R RAT 


5H2A PIG 


A2AA RAT 


SSR4 MOUSE 


A2AB HORSE 


MC4R HUMAN 


5H2A MACMU 


A2AA PIG 


SSR4 HUMAN 


A2AB ERIEU 


A1AB CANFA 


5H2A HUMAN 


A2AA MOUSE 


D2D1 XENLA 


A2AB ELEMA 


NK2R RAT 


5H2A CRIGR 


ACM 3 RAT 


AA3R HUMAN 


A2AB DUGDU 


NK2R MOUSE 


5H2A RAT 


A2AR LABOS 


5H7 XENLA 


A1AD HUMAN 


NK2R MESAU 


5H2A MOUSE 


ACM4 CHICK 


D3DR RAT 


A2AB DIDMA 


D2D2 XENLA 


PACR MOUSE 


AA2A HUMAN 


D3DR MOUSE 


A1AD RAT 


MSHR HUMAN 


5H2C RAT 


AA2A CANFA 


A1AA ORYLA 


A1AD RABIT 


5H1F RAT 


5H2C HUMAN 


AA2A RAT 


A2AB CAVPO 


A1AD MOUSE 


5H1F MOUSE 


5H2C MOUSE 


AA2A MOUSE 


D2DR MOUSE 


OPRM RAT 


5H1F CAVPO 


PACR BOVIN 


A1AA RAT 


D2DR HUMAN 


OPRM PIG 


5H1F HUMAN 


5H2B MOUSE 


A1AA RABIT 


D2DR FUGRU 


OPRM MOUSE 


DOP2 DROME 


PACR HUMAN 


A1AA HUMAN 


D2DR CERAE 


OPRM HUMAN 


HH2R CAVPO 



YOR920000435US1 



-34- 



T)A HT5 DAT 


A1AA BOVIN 


D2DR BOVIN 


OPRM BOVIN 


HH2R CAN FA 


D4DR MOUSE 


ACM 2 RAT 


A2AB RAT 




HH1R MOUSE 


SCRC HUMAN 


ACM 2 PIG 


A2AB MOUSE 


AA2A CAVPO 


GASR PRANA 


D4DR HUMAN 


ACM2 HUMAN 


A2AB HUMAN 


MSHR BOVIN 


GASR HUMAN 


VI PR MELGA 


ACM2 CHICK 


5H4 MOUSE 


MSHR VULVU 


A2AB BOVIN 


5H2B RAT 


ACM 5 RAT 


ACM1 RAT 


MSHR SHEEP 


5H1B RABIT 


A2AR CARAU 


ACM 5 MACMU 


ACM1 PIG 


MSHR RANTA 


MSHR HORSE 


B2AR MACMU 


ACM 5 HUMAN 


ACM1 MACMU 


MSHR OVIMO 


GASR RABIT 


B2AR HUMAN 


5HT1 DROME 


ACM1 HUMAN 


MSHR DAMDA 


GASR MOUSE 



251 through 300 | 301 through 350 I 1 5 1 through 400 I 401 through 450 1 451 though 500 I 



GASR CAN FA 


IL8B GORGO 


GPRL HUMAN 


AG2T RAT 


BRS4 BOMOR 


GASR BOVIN 


IL8B BOVIN 


MGR8 HUMAN 


ML1A SHEEP 


V1BR_RAT 


5H1D RAT 


5H5A RAT 


GPCR LYMST 


ML1A PHOSU 


BRS3 SHEEP 


5H1D MOUSE 


5H5A MOUSE 


A2AB AMBHO 


ML1A HUMAN 


BRS3 MOUSE 


5 HID FUGRU 


5H6 HUMAN 


BRB2 HUMAN 


MGR8 RAT 


BRS3 HUMAN 


5H1D CAVPO 


VI PS RAT 


OLF5 RAT 


IL8A GORGO 


OPSG CHICK 


5H1B SPAEH 


VI PS MOUSE 


5H1E HUMAN 


GRPR HUMAN 


MGR8 MOUSE 


5H1B RAT 


VI PR MOUSE 


5H1D RABIT 


GPRC RAT 


OPSB ANOCA 


5H1B MOUSE 


CCKR RAT 


OPS1 PATYE 


GPRC MOUSE 


GHSR RAT 


GASR RAT 


CCKR HUMAN 


CKR1 MACMU 


GPRC HUMAN 


GHSR PIG 


YYI3 CAE EL 


CCKR CAVPO 


CKR1 HUMAN 


GPR3 MOUSE 


GHSR HUMAN 


MSHR CHICK 


IL8B PANTR 


BRB2 MOUSE 


GPR3 HUMAN 


FML1 PANTR 


5H1B CRIGR 


IL8B MACMU 


DADR XENLA 


EDG2 SHEEP 


SSRL FUGRU 


5H1B CAVPO 


IL8B HUMAN 


DADR PIG 


EDG2 MOUSE 


CASR RAT 


SSR2 RAT 


IL8A PANTR 


DADR HUMAN 


EDG2 HUMAN 


5HT2 APLCA 


SSR2 MOUSE 


IL8A HUMAN 


DADR DIDMA 


BLR1 MOUSE 


OPSD ALLMI 


B3AR PIG 


AG2R MELGA 


D1DR CARAU 


OLF4 RAT 


CCR4 RAT 


SSR2 HUMAN 


AG2R_CHICK 


5HT1 APLCA 


GRPR_RAT 


CCR4 PAPAN 


5H1B_HUMAN 


OLF0 RAT 


5H2A CANFA 


GRPR MOUSE 


CCR4 MOUSE 


OPRX RAT 


nDDP TTT Tiyr tv -kt 


OLF9 RAT 


GALS HUMAN 


CCR4 MACMU 


OPRX MOUSE 


CCKR XENLA 


OLF1 RAT 


FMLR MACMU 


CCR4 MAC FA 


OPRX CAVPO 


AG2S RAT 


FML1 PONPY 


CKR3 MACMU 


CCR4 HUMAN 


HH2R HUMAN 


AG2S MOUSE 


DCDR XENLA 


AG2R RABIT 


CCR4 FELCA 


5H1D HUMAN 


AG2S HUMAN 


DADR RAT 


EDG2 BOVIN 


CCR4 CERTO 


5H1D CANFA 


AG2R RAT 


FML1 MACMU 


GALS RAT 


CCR4 BOVIN 


HH2R RAT 


AG2R PIG 


FML1 HUMAN 


GALS MOUSE 


APJ HUMAN 


OPRX PIG 


AG2R MOUSE 


FML1 GORGO 


G10D RAT 


OPSD RANTE 


OPRX HUMAN 


AG2R MERUN 


GPR1 RAT 


G10D MOUSE 


5H1B PIG 


SSR2 PIG 


AG2R HUMAN 


BRB2 RAT 


OPSD RAT 


CKR8 MOUSE 


SSR2 BOVIN 


AG2R CANFA 


GALR RAT 


CKR3 MOUSE 


CKR1 MOUSE 


TLR2 DROME 


AG2R BOVIN 


GALR MOUSE 


V1BR HUMAN 


OPSP CHICK 


HH2R MOUSE 


5H6 RAT 


GALR HUMAN 


FML1 MOUSE 


OPSD OCTDO 


5H1B DIDMA 


GPRF MACNE 


D5DR FUGRU 


OPSD CRIGR 


OPRM CAVPO 


A2AB ECHTE 


GPRF CERAE 


D1DR OREMO 


BRS3 CAVPO 


OPSX MOUSE 


5HT HELVI 


GP3 8 HUMAN 


FMLR MOUSE 


ML1A CHICK 


OPSX HUMAN 


5H1D PIG 


5H2A CAVPO 


BRB2 RABIT 


TLR1 DROME 


C3AR HUMAN 


OPRD RAT 


CCKR MOUSE 


SSR5 HUMAN 


OPSD TRIMA 


OPSD RAJER 


OPRD MOUSE 


YDBM CAE EL 


DBDR RAT 


OPSD SHEEP 


ML IX HUMAN 


OPRD HUMAN 


NYR DROME 


DBDR HUMAN 


OPSD RABIT 


AG2S XENLA 


GALT RAT 


OLFD CANFA 


5H2B PIG 


OPSD PIG 


OLF6 RAT 
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HAT.T MPlUC!!? 

VJ.tt.lJ X i"i.\J Udd 


bKfK MOUSE 


ML1A MOUSE 


OPSD PHOVI 


ML1B HUMAN 




PD PD T TT TTVjf 7\ XT 

LjKrK HUMAN 


MC4R MOUSE 


OPSD PHOGR 


OLFJ HUMAN 


GPRA HUMAN 


SSR5 RAT 


MAM2 SCHPO 




ML1B CHICK 


IL8B RAT 


SSR5 MOUSE 


FMLR RABIT 


OPSD MOUSE 


GPRJ HUMAN 


IL8B MOUSE 


OLFE HUMAN 


D1DR FUGRU 


OPSD_MACFA 


GPR4 PIG 


IL8A RABIT 


OPRD PIG 


IL8A RAT 


OPSD HUMAN 


GPR4 HUMAN 


GRFR RAT 


IL8B RABIT 


BLR1 RAT 


OPSD CANFA 


CKR8 HUMAN 


5H5B RAT 


0LF4 CANFA 


DBDR XENLA 


OPSD BOVIN 


GPRJ MOUSE 


5H5B MOUSE 


GU2 7 RAT 


CASR HUMAN 


GALT HUMAN 


GPR8 HUMAN 


GPRA RAT 


OLFI HUMAN 


BLR1 HUMAN 


0AR2 LYMST 


OPSD XENLA 



501 through 550 [ 551 through 600 I 6 0 1 through 650 I 65 1 through 700 I 701 though 750 



OPSD TURTR 


NY2R PIG 


OPS I AST FA 


GC96 HUMAN 


OPSP PETMA 


OPSD RANPI 


NY2R HUMAN 


OPSG ORYLA 


YTJ5 CAEEL 


OPSD ZEUFA 


OPSD RANCA 


NY2R BOVIN 


OPSG CARAU 


OLF8 RAT 


OPSD COTBO 


OPSD MESBI 


NY2R MOUSE 


OPSD GAMAF 


NY1R XENLA 


OPSD ABYKO 


OPSD GLOME 


CKR6 HUMAN 


FML2 MACMU 


GPRS HUMAN 


OLF2 CHICK 


OPSD DELDE 


C3AR MOUSE 


C5AR CANFA 


GIPR RAT 


DADR BOVIN 


OPSD BUFMA 


VQ3L CAPVK 


TRFR CHICK 


C5AR RAT 


OLF5 CHICK 


OPSD BUFBU 


OXYR PIG 


THRR RAT 


BONZ HUMAN 


OLF3 CHICK 


OPSD AMBTI 


OXYR MOUSE 


THRR PAPHA 


YR4 2 CAEEL 


OLFI CHICK 


ML1C CHICK 


OXYR MACMU 


THRR MOUSE 


OPSD NEOAU 


FSHR PIG 


CASR BOVIN 


OXYR BOVIN 


THRR HUMAN 


CKR4 HUMAN 


FSHR MAC FA 


TRFR SHEEP 


EDG1 RAT 


THRR CRILO 


V1AR HUMAN 


FSHR HUMAN 


TRFR RAT 


EDG1 MOUSE 


CCR3 HUMAN 


OPSD SARTI 


FSHR HORSE 


TRFR MOUSE 


EDG1 HUMAN 


GPRO RAT 


OPSD SARSP 


FSHR EQUAS 


TRFR HUMAN 


ACTR HUMAN 


THRR XENLA 


BONZ MACNE 


DBDR BOVIN 


OXYR HUMAN 


OPSD SARPU 


PTRR DIDMA 


BONZ CERAE 


AG22 SHEEP 

£-t ^ iJXlij J_j XT 


OPSD LAMJA 


DADR RABIT 


NTR1 HUMAN 


RDC1 HUMAN 


US 2 8 HCMVA 


OPSD_CHICK 


C5AR GORGO 


5H1E PIG 


PTRR PIG 


PTRR RAT 


ML1C XENLA 


YLD1 CAEEL 


OLF3_CANFA 


PTRR HUMAN 


PTRR MOUSE 


GPRX ORYLA 


FMLR PONPY 


GPRJ RAT 


YR13_CAEEL 


OPSB APIME 


OPS1 SCHGR 


OPSB CONCO 


GPRD HUMAN 


OPSD NEOSA 


OLF7 RAT 


OLF2 RAT 


ML1X MOUSE 


GIPR HUMAN 


OPSD COMDY 


OL1C HUMAN 


ML1X SHEEP 


FSHR SHEEP 


OPSF ANGAN 


OPSD CATBO 


NY6R MOUSE 


CCR4 SHEEP 


FSHR BOVIN 


OPSD TAUBU 


OLF6 MOUSE 


ET1R RAT 


PE2 2 RAT 


ACTR BOVIN 


OPSD BATNI 


OPSD CAMAB 


ET1R PIG 


OPSP COLLI 


PE2 2 MOUSE 


OPSD BATMU 


OLFI HUMAN 


ET1R HUMAN 


GPR6 RAT 


PE22 HUMAN 


P2Y9 HUMAN 


OLFI CANFA 


ET1R BOVIN 


GPR6 HUMAN 


OX2R RAT 


P2Y5 HUMAN 


ETBR RAT 


OPSB CHICK 


ACM1 DROME 


OX2 R HUMAN 


P2Y5 CHICK 


ETBR PIG 


OLF2 HUMAN 


V1AR MOUSE 


5H1B CANFA 


OXYR SHEEP 


ETBR MOUSE 


OPSD LIMPA 


FMLR PANTR 


OGR1 HUMAN 


OPSD NEOAR 


ETBR HUMAN 


OPSD CYPCA 


FMLR HUMAN 


GPRV HUMAN 


NMBR RAT 


ETBR HORSE 


OPSD CARAU 


RDC1 CANFA 


ACTR MOUSE 


NMBR MOUSE 


ETBR COTJA 


OL15 MOUSE 


OPSD ANOCA 


ACTR MESAU 


TDA8 MOUSE 


ETBR CANFA 


GU5 8 RAT 


GPRK HUMAN 


FMLR GORGO 


H218 RAT 


ETBR BOVIN 


GU3 8 RAT 


FML2 PONPY 


OPSB GECGE 


GPRH HUMAN 


OPS2 LIMPO 


GU01 RAT 


FML2 PANTR 


RDC1 MOUSE 


CKR7 MOUSE 


OPS1 LIMPO 


OPSD PARKN 


FML2 HUMAN 


OXYR RAT 


CKR7 HUMAN 


OL7B MOUSE 


OPSD COTIN 


FML2 GORGO 


OPSU BRARE 


AG22 RAT 


OLID HUMAN 


NY6R RABIT 


EBI2 HUMAN 


OPSD SARXA 


AG2 2 MOUSE 


OL1A HUMAN 


OPSD ICTPU 
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C5AR PONPY 


ort.rCi v li. 


Ahz z HUMAN 


GU03 RAT 


GPRD RAT 


C5AR PANTR 


npqn QapriT 


OLr 3 HUMAN 


GRHR CLAGA 


GIPR MESAU 


C5AR HUMAN 


OPSD MYRBE 


CKR4 MOUSE 


CKR3 HUMAN 




V1AR_SHEEP 


NTR1 RAT 


OPSD ANGAN 


CKR3 CERAE 


GU45 RAT 


V1AR RAT 


GPRO HUMAN 


NMBR HUMAN 


C5AR_MACMU 


DEZ HUMAN 


OPSP ICTPU 


OPSH CARAU 


OPSD TODPA 


YW01 CAE EL 


OL13 MOUSE 


NY2R CAVPO 


OPSD POERE 


OPSD SEPOF 


0L1G HUMAN 


GPR7 HUMAN 


GPR1 HUMAN 


OPSD ORYLA 


OPSD PROJE 


YT6 6 CAE EL 


0X1R RAT 


AG2R XENLA 


OPSD MYRVI 


OPSD LOLFO 


OLF4 CHICK 


0X1R HUMAN 


OLF3 RAT 


C5AR MOUSE 


OPSD COTGR 


HM74 HUMAN 


BRB1 RABIT 



751 through 800 


801 through 850 


851 through 900 


901 through 947 
















US 2 7 HCMVA 


GRHR MOUSE 


P2YR MOUSE 






NODC RHISM 


GRHR HUMAN 


P2YR HUMAN 




ODCPl T OT OTT 


GP42 HUMAN 


GRHR HORSE 


P2YR BOVIN 




PDDD TTT TKA 7\ \T 

urKr HUMAN 


GP41 HUMAN 


GRHR BOVIN 


OPSB ORYLA 




LivKV MOUSE 


MGR4 RAT 


GPR2 HUMAN 


NU4M API ME 




LrLKb RAT 


LSHR SHEEP 


GLPR MOUSE 


ML1A BOVIN 




f I/"D C MAT TC r> 

LJvKb MOUSE 


LSHR PIG 


GLPR HUMAN 


YCX7 YEAST 




rJAK2 bCHCO 


LSHR HUMAN 


FSHR RAT 


SCG2 XENLA 




DT ID T TT TKA 7\ >T 

rlzR HUMAN 


LSHR CALJA 


YKY4 YEAST 


PIP_BACCO (FP) 




UFbU LOIKh 


GLR MOUSE 


OX2R PIG 


PI2R RAT 




VJvUz bPVKA 


EDGL MOUSE 


OPSV CHICK 


PI2R BOVIN 




PlTT *7 DAT 1 


OPSD POMMI 


OPSB SAIBB 


PAFR RAT 






MGR7 RAT 


OPSB RAT 


P2UR RAT 




LrvKb FAPHA 


MGR7 HUMAN 


OPSB MOUSE 


P2UR MOUSE 




LJvKb PANTR 


CML2 RAT 


OPSB HUMAN 


P2UR HUMAN 




CKKb MACMU 


NY1R RAT 


OPS4 DROPS 


OPSR ORYLA 




CKR5 GORGO 


NY1R PIG 


OL1L HUMAN 


OPSO SALSA 




LKR5 CERTO 


NY1R MOUSE 


OL1B HUMAN 


GTR2 LEIDO 




CKR5 CERAE 


NY1R_HUMAN 


HBDCLOTS (FP) 


GLHR ANTEL 




CKR2 MOUSE 


NY1R CANFA 


GRHR RAT 


GARP HUMAN 




CKR2 HUMAN 


RTA RAT 


EDG3 HUMAN 


PAFR MOUSE 




OPSB CARAU 


OPSV XENLA 




OPSD APIME 




0PS2 PATYE 


OPSU CARAU 


YIPC YEAST 


MAS RAT 




GP4 3 HUMAN 


OPS1 DROPS 


PTH2 RAT 


MAS MOUSE 




GCRC MOUSE 


OPS1 DROME 


PAFR CAVPO 


MAS HUMAN 




BRB1 HUMAN 


OPS1 CALVI 


OPSB BOVIN 


CIN6 HUMAN 




VC03 SPVKA 


MGR6 RAT 


OPS2 SCHGR 


CIN3 RAT 




PE24 RAT 


GLPR RAT 


OLF6 CHICK 


CB1R RAT 




PE24 RABIT 


ETBR MAC FA 


NY4R RAT 


CB1R MOUSE 




PE24 MOUSE 


TSHR SHEEP 


NY4R MOUSE 


CB1R HUMAN 




PE24 HUMAN 


TSHR BOVIN 


GRHR PIG 


CB1R FELCA 




YYOl CAE EL 


OPSR HORSE 


GPRM HUMAN 


CB1B FUGRU 




YR41 CAEEL 


OPSG ODOVI 


GP3 9 HUMAN 


CB1A FUGRU 




V2R RAT 


LSHR RAT 


PTR2 HUMAN 


YQGP BACSU(FP) 




V2R PIG 


LSHR MOUSE 


PE21 RAT 


VIRR_AGRT6 (FP) 




V2R HUMAN 


LSHR BOVIN 


PE21 MOUSE 


PLSC COCNU 




V2R BOVIN 


GLR RAT 


PAFR HUMAN 


OPS 6 DROME 




OPS1 HEMSA 


DBDR MACMU 


OPS 4 DROME 


NY5R RAT 




CML2 HUMAN 


TSHR MOUSE 


GLR HUMAN 


NY5R PIG 




CKR5 HUMAN 


TSHR HUMAN 


YMJC CAEEL 


NY5R MOUSE 





YOR920000435US1 



-37- 




P?Y7 T-TTTMAM 


lonK LAINrA 


PE21 HUMAN 


NY5R HUMAN 








OPSD CORAU 


NY5R CANFA 




OPSG ASTFA 


OPSG GECGE 


OLIH HUMAN 


NTT? 9 PZ\T 




OPSD_ASTFA 


0LF2_CANFA 


YJZ3 YEAST 


NTR2 MOUSE 




0PS5 DROME 


MGR6 HUMAN 


PROAHAEIN(FP) 


MGR3_RAT 




MGR4 HUMAN 


GPRI HUMAN 


LSHR CHICK 


MGR3 HUMAN 




ACTR PAPHA 


GPRE RAT 


FSHR CHICK 


GUSB BOVIN 




OPSD LIMBE 


NY4R HUMAN 


PI2R MOUSE 






YXX5 CAEEL 


ET3R XENLA 


PF2R MOUSE 






AAIR MOUSE 


GRHR SHEEP 


P2YR RAT 







A Third Example: the helix-turn-helix DNA binding motif 

The third example that showcases the present invention corresponds to the 
5 hehx-turn-helix motif that mediates the binding of many regulatory proteins to regulatory 
control sites of DNA. This 20 amino-acid long structural motif consists of two helices (7 
and 9 a.a. respectively) that are separated by a 4 amino acid turn that are held together 
through non-polar interactions of their side chains. It has been argued that 
sequence-based analysis using traditional approaches cannot unambiguously identify 

10 helix-turn-helix motifs unless it is combined with the use of stereo-chemical constraints. 
More recently, a pattern-based approach started with 91 carefully-selected, aligned 
sequence fragments that corresponded to known helix-turn-helix instances and produced 
significant results by essentially estimating a pattern-based profile for the helix-turn-helix 
binding motif. This set of 91 fragments is particularly interesting because it is a very 

15 diverse collection of helix-turn-helix motif instances that share very little at the sequence 
level. 

In the experiment carried out, a subset of 70 fragments from the set of 91 
were selected (excluding those of the helix-turn-helix instances that corresponded to 
pieces of homeoboxes) and no alignment information was assumed. Additionally, each 
20 of the fragments was extended to the left and to the right by including an additional 10 
amino acids, thus producing fragments that were 40 amino acids long. Again, the 
patterns were discovered assuming the equivalence classes {A, G}, C, {D, E}, {F, Y}, H, 
{I, L, M, V}, {K, R}, {N, Q}, P, {S, T}, W. The Teiresias parameters were set to L=5, 
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W=10 whereas the successive threshold choices were K=70/5=14, K=3 and K=2. It was 
set out to discover patterns that involved at least 5 non-wild cards in any rolling window 
spanning 10 positions that begins/ends with a literal, a relatively high-degree of local 
similarity (i.e. 50% or higher). From the discovered set, those patterns whose estimated 
5 log-probability was equal to -30.0 or less were selected, thus giving rise to a composite 
descriptor with 517 patterns. Table 7 below lists the labels of the 70 fragments in this 
training set. Table 7 shows Swiss-Prot labels of the 70 sequence fragments with length 
40 a.a. in the training set for the helix-turn-helix experiment. 



Table 7 

1 through 20 | 21 through 40 | 41 through 60 | 61 through 70 | 



BIRA ECOLI 


TNP0 ECOLI 


TNP3 ECOLI 


RCRO BPP22 


CYTR ECOLI 


DNIV BPPl 


DNIV ECOLI 


VG3 0 BPPH8 


RBTR KLEAE 


VPB BPMU 


DNIV SALTY 


RPC BPPHl 


ASNC ECOLI 


LAC I ECOLI 


RCRO LAMBD 


DBNE BPMU 


CRP ECOLI 


PURR ECOLI 


RPC2 LAMBD 


DBNE BPD10 


ARAC ERWCH 


DEOR ECOLI 


RCRO BP4 34 


RP3 2 ECOLI 


ADA ECOLI 


ARAC ECOLI 


RPCl BPP22 


RPSF BACSU 


DICC ECOLI 


FNR ECOLI 


RPCl BPPH8 


RPSE BACSU 


LYSR ECOLI 


DICA_ECOLI 


RPC BP163 


RP54 KLEPN 


ILVY ECOLI 


FIS ECOLI 


RPC BPP2 


RP54 AZOVI 


TRPI PSEAE 


METR SALTY 


VPC BPMU 




NOD2 RHIME 


AMPR ENTCL 


RPSD BUCAP 




XYLR BACSU 


NODI RHIME 


RPSA BACSU 




NIFA RHIME 


XYLS PSEPU 


RPSB BACSU 




NTRC RHIME 


NIFA KLEPN 


RP54 RHIME 




MERR STAAU 


NTRC KLEPN 


PARB ECOLI 




NAHR PSEPU 


MERR BACSR 


SOPB ECOLI 




TER2 ECOLI 


MERR PSEAE 


RPCl LAMBD 




TNP2 ECOLI 


TER3 ECOLI 


RPCl BP434 




TNPl ECOLI 


TNP5 PSEAE 


RPC2 BPP22 





The resulting DFA (deterministic finite automaton, which will only 
recognize instances of the composite descriptor patterns in a query sequence and which 
performs method step 260) was used to search the randomized version RAND-Swiss-Prot 
15 of Swiss-Prot (Release 38.0) and therein were discovered a total of 277 randomized 
sequences that received non-zero support. Of the 277 randomized sequences, 275 
received a support value that was less than or equal to 6. Thus, Thres mid was set equal to 
7. This threshold choice corresponded to the 99.2-th percentile. Fig. 4 shows the 
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histogram of the scores for the sequences of RAND-Swiss-Prot that received non-zero 
support. 

Subsequent search of the actual Swiss-Prot database gave rise to 193 
sequences that received support greater than or equal to Thres rand =7. The support values 
ranged from the minimum allowed value of 7 to a maximum value of 66. 

Next, the Swiss-Prot annotation (feature table "FT" lines and description 
"DE" lines) was used for each of these 193 sequences. Of these, 169 are actually listed in 
Swiss-Prot as containing a helix-turn-helix motif, 2 are listed as belonging to an H-T-H 
group from PFAM (Y4WC_RHISN, Y4AM RHISN) and 3 are listed as having 
dna-binding properties (VR2B_BPT4) or being putative DNA replication proteins 
(Y4CK_RHISN) or being a cytosine-specific methyltransferase (MTE8_ECOLI). Of the 
remaining proteins, 1 is listed as hypothetical protein (YP60_METTM), 1 is listed as a 
hypothetical transcription factor containing a helix-turn-helix motif (Y558_METJA), 1 is 
listed as being involved in DNA packaging (XTMA_BACSU), 1 is listed as having 
strong similarity to MJ1545 which is a putative transcription repressor protein containing 
a helix-turn-helix motif (Y014_ARCFU), 3 have very good blastp P-values with all the 
similarities confined in the helix-turn-helix region of the input fragments 
(PPvPD_SALTY, PRPD_ECOLI, Y0FO_MYCTU), and finally, 2 are likely to be false 
positives (YOAE_ECOLI, CTPE_MYCTU). Table 8 below contains a listing of the 
labels of these 193 hits in order of decreasing value of accumulated support. Table 8 
shows the Swiss-Prot labels of the 193 sequence fragments that are discovered using the 
composite descriptor derived from the original set of 70 fragments. 



Table 8 

I 1 through 50 I 51 t hrough 100 I 101 through 150 I 151 through 19.1 



RPSF BACSU 


RP54 CAUCR 


RPSD SERMA 


NIFA KLEOX 


RPSE_BACSU 


RBSR ECOLI 


RPSD SALTY 


MERR BACSR 


RPSF BACLI 


PURR_HAEIN 


RPSD PSEFL 


HIPB ECOLI 


RP3 5 BACTK 


FIS HAEIN 


RPSD PSEAE 


FIXK BRAJA 


RPSE CLOAB 


TNP2 ECOLI 


RPSD ECOLI 


CTPE MYCTU 


RPSF BACME 


RPCl BPP22 


RPSD BUCAP 


YCIT ECOLI 


RPSG CLOAB 


RP54 AZOCA 


RPCl BPD3 


RPSD NEIGO 
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KPSB BACSU 


RP3 2 PROMI 


RP5 5 BRAJA 


RPOD STRPN 


RPC1 LAMBD 


RP3 2 ENTCL 


RP54 RHISN 


RPC BPPH1 


KPSG BACSU 


RP32 ECOLI 


RP54 RHILP 


REGL STRLI 


1LR3 ECOLI 


RP32 CITFR 


RP32 SERMA 


PRPD_SALTY 


VbJU BPPH8 


NTRC BRASR 


PURR ECOLI 


PRPD ECOLI 


T 7\ f" 1 T IT/TNT T" 

LAL1 ECOLI 


NTRC AZOCA 


NTRC RHOCA 


NODD RHILE 


IhRl ECOLI 


NODI RHIGA 


MALI ECOLI 


NOD 3 RHIME 


KPL BP163 


NAHR PSEPU 


EBGR ECOLI 


N0D2 RHISN 


KB IK KLEAE 


GALR SALTY 


CSCR ECOLI 


N0D2 BRAJA 


VTl O D RAT TIT T 


GALR ECOLI 


CRP HAEIN 


MALR STRCO 


KJJbA erwca 


FIS ECOLI 


Y014 ARCFU 


HRDA STRCO 


KPC2 BPP22 


FADR HAEIN 


XTMA BACSU 


HMOS CAEEL 


HLYX ACTPL 


FX24 RHILV 


SCRR PEDPE 


YOAE ECOLI 


FNR SALTY 


ENDR PAEPO 


RPC2 LAMBD 


Y4CK RHISN 


FMR HAEIN 


YCJW ECOLI 


RPC2 BP434 


YOFO MYCTU 


FNR ECOLI 


TNP7 ECOLI 


RP54 SALTY 


TYRR HAEIN 


ETRA SHEPU 


TNP5 PSEAE 


RP54 KLEPN 


TYRR ECOLI 


H nO T"N T T TV T*~l TIT 

RPSD HAEIN 


NTRC AZOBR 


RP54 ECOLI 


TRA6 PSEAE 


RP54 PSEPU 


MTE8 ECOLI 


RP54 BRAJA 


RPSD PSEPU 


T") a non ti n 

KPb4 PSEAE 


CRP SALTY 


N0D2 BRAEL 


RPSD CAUCR 


n n r ^ 7\ rr at t t 

KPb4 AZOVI 


CRP ECOLI 


NODI RHISN 


RPSD BACSU 


lEKz ECOLI 


ASCG ECOLI 


N0D1_BRASN 


RP54 THIFE 


RPSD STAAU 


ADA ECOLI 


NODI BRAJA 


N0D2 RHILP 


RPSD LEPIN 


RP54 ALCEU 


MALR STRPN 


NIFA RHOCA 


RPSD ENTFA 


GALS ECOLI 


MALI VIBFU 


NIFA ENTAG 


RPSA BACSU 


SCRR VIBAL 


GNTR ECOLI 


NIFA AZOVI 


DEOR ECOLI 


RP55 RHIME 


DBNE BPD10 


NIFA AZOCH 


BIRA SALTY 


RP54 RHIME 


Y5 5 8 METJA 


MERR STAAU 


BIRA ECOLI 


RP2 8 BACTK 


Y4WC RHISN 


ILVY SALTY 


YP6 0 METTM 


REGA CLOAB 


Y4AM RHISN 


ILVY ECOLI 


PARB ECOLI 


NODD BRASP 


Y2 72 METJA 


FECI ECOLI 


NTRC RHIME 


CCPA STRMU 


TRPI PSESY 


CYTR ECOLI 


NOD 2 RHIME 


ASNC ECOLI 


TRPI PSEAE 


BTR BORPE 


NODI RHIME 


SCRR STAXY 


RPSD STRAU 


ARAC ERWCH 


TER8 PASMU 


RPSK BACSU 


RP54 VIBAN 


AMPR ENTCL 


RPSD LISMO 




KPb4 ACICA 


AMPR CITFR 


RCRO BPP22 


RBSR BACSU 


RCRO LAMBD 




FNRL RHOSH 


KDGR BACSU 


RCRO BP434 




FIXK RHIME 


DEGA BACSU 


RAFR ECOLI 




FIXK AZOCA 


ASNC HAEIN 


NODD RHILV 




TER8 PASPI 


VR2B BPT4 


NODD RHILT 




TER4 ECOLI 


VPB BPMU 


NODI BRAEL 




RPC1 BP434 


SCRR STRMU 


NIFA KLEPN 





Starting now with the set of all 193 discovered sequence fragments, one 
more iteration of the described method was carried out using this set as the new training 
set, T. The training set for this iteration was formed by collecting the individual sequence 
fragments whose support exceeded threshold. As before, the Teiresias parameters were 
set to L=5 and W=10 whereas the successive threshold choices were K=l 93/5=38, K=7 
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and K=2. Sub-selecting those patterns whose estimated log-probability was equal to 
-30.0 or less produced 1,061 patterns which were added to the previous set of 517 to form 
a new augmented composite descriptor. The DFA resulting from the latter descriptor was 
applied to RAND-Swiss-Prot. Of the 537 sequence fragments that received non-zero 
support, 534 received support 9 or less thus establishing the value 10 as the new Thres^d 
(=99.2-th percentile). Processing Swiss-Prot with this last DFA, an additional 96 
sequence fragments were discovered that exceeded threshold for a grand total of 289 
fragments. Table 9 here lists the labels for this additional set of fragments. Table 9 
shows the Swiss-Prot labels of the additional 96 sequence fragments that are discovered 
after augmenting the original composite descriptor with the patterns that are discovered 
from treating the first set of 193 discovered fragments as a training set. 
. Table 9 



1 through 25 



26 through 50 



51 through 75 



76 through 96 



HRDB STRCO 


CCPA BACSU 


FNRA PSEST 


VMEM PVSP 


EMRD ECOLI 


CCPA BACME 


ANR PSEAE 


V5 7A BPT4 


HRDD STRCO 


YJGS ECOLI 


YH93 ARCFU 


SP3D BACSU 


RPSD LACLA 


RP3 2 VIBCH 


RPOS VIBCH 


RPC BPP2 


RPSD SYNP7 


RBSR HAEIN 


RPOS PSEAE 


MALR STAXY 


RPSD MI CAE 


YFED ECOLI 


YFER ECOLI 


EBSC ENTFA 


RPSD ANASP 


RPSD RICPR 


Y701 SYNY3 


VG3 6 BPML5 


RPSD AGRTU 


RPSD BORBU 


Y4BA RHISN 


VG3 6 BPMD2 


YOlW MYCTU 


RPSD HELPY 


RP32 PSEAE 


PRPR SALTY 


RPSD_CHLTR 


YYAA BACSU 


FRVR ECOLI 


MERB SERMA 


RPSD MYXXA 


RP54_XANCV 


ARAC SALTY 


BRPA STRHY 


RPSD TREPA 


SACR LACLA 


YG27 ARCFU 


ARAC ECOLI 


RPSD RHOCA 


NIFA RHISN 


XYLR BACSU 


ARAC CITFR 


YVDE BACSU 


NIFA RHIET 


RPSC ANASP 


ACOR ALCEU 


RPSW STRCO 


NIFA BRAJA 


NADR KLEPN 


YYAG BACSU 


Y151 METJA 


NFXB PSEAE 


YRDX RHOSH 


YSCC YEREN 


RPOS YEREN 


TRA6 BACST 


YAHB ECOLI 


XYS4 PSEPU 


RPOS SHIFL 


RP54 BACSU 


TRA4 BACFR 


XYSl PSEPU 


RPOS SALTY 


ACRR ECOLI 


RPSC SYNY3 


XYLS PSEPU 


RPOS SALTI 


YFET ECOLI 


RP32 CAUCR 


THCR RHOSN 


RPOS SALDU 


RP54 TREPA 


NIFA AZOLI 


TETP CLOPE 


RPOS ECOLI 


EXPR ERWCH 


NIFA AZOBR 




PEPR LACDL 


ECHR ERWCH 


MLTD ECOLI 




GALR HAEIN 


SORC KLEPN 


AADR RHOPA 




CCPA STAXY 


RP54 RHOCA 


YDT6 SCHPO 





15 



An analysis of the additional hits using the feature tables in Swiss-Prot 
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# 



showed that 81 of those are true positives, 4 are listed as DNA binding 
(TRA4_BACFR,V57A_BPT4,NADR_KLEPN) or transcription regulation proteins 
(EBSC_ENTFA), and 2 are listed as hypothetical proteins (YFED_ECOLI, 
YDT6_SCHPO). Finally, 8 hits probably correspond to false positives (Y4BA_RHISN, 
5 EMRDECOLI, VG36_BPMD2, VG36_BPML5, TETP_CLOPE, YSCCYEREN, 
MERBSERMA, MLTD_ECOLI). 




15 



A Fourth Example: Searching The C. elegans genome for EF1G, GPCR and HTH 
Candidates 

The three composite descriptors were used to search the collection of 
9,099 ORFs that were reported for the C. elegans genome (see: 
http://genome.wustl.edu/gsc/Cyjlegans) as of June 13, 1999. In all three cases, the 
corresponding values of Thres ran d that were established by searching RAND-Swiss-Prot 
were used. 

Elongation Factor 1 Gamma Chain 



First, this ORF collection was searched using the 2,260 pattern composite 
descriptor that was built for the elongation factor gamma chain (PS50040 above). Of the 
20 13 ORFs that received non-zero support only one, F17C11.9, exceeded threshold. This 
ORF is the one listed in Swiss-Prot (and in PS50040) as EF1G CAEEL. 



G-protein Coupled Receptors 

Next, the C. elegans genome was searched using the composite descriptor 
for the G protein-coupled receptor that comprised 1,703 patterns. Note that for this 
particular experiment, it was not set out to discover and enumerate all putative G-protein 
coupled receptors in C. elegans but rather to show that even when starting with a small 
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knowledge base that contains no GPCR sequences from the genome under consideration 
it can be effective to mine a complete genome such as C. elegans. 

In Table 10 below, the labels of the 101 C. elegans ORFs whose support 
exceeded threshold are shown. For each of those ORFs, the Score and the P and N values 
are shown for the top scoring sequence obtained from running a BLASTP search against 
the set of 804 Swiss-Prot Rel. 35.0 sequences that are known to be true GPCRs (see also 
discussion above). Table 10 shows the 101 ORFs from C. elegans that were discovered 
using a composite descriptor for the GPCR family and whose support exceeds threshold. 
For each of the reported ORFs, also listed are the top scoring sequence from running 
blastp against the set of 804 Swiss-Prot Rel. 35 sequences that are known to be true 
GPCRs. 



Table 10 



# 


C. elegans ORF 
Label 


Top Scoring 
Training Set Seq. 


Score 


P 


N 








1 


M03F4 . 3 


5H1A MOUSE 


190 


2 .300E-73 


6 


2 


K09G1 .4 


D3DR RAT 


272 


4 . 100E-79 


6 


3 


K02F2 .6 


OAR DROME 


235 


1 .200E-59 


5 


4 


F14D12 .6 


5H1A MOUSE 


214 


7 . 300E-77 


6 


5 


C09B7 . 1 


5H1A MOUSE 


265 


1 . 000E-61 


4 


6 


C02D4 .2 


OAR DROME 


292 


3 .100E-11 


5 


7 


ZK455 .3 


GRPR MOUSE 


181 


6.600E-38 


5 


8 


C52B11 .3 


5H1A MOUSE 


232 


6 . 200E-64 


5 


9 


F15A8 . 5 


DO PI DROME 


300 


6 .400E-85 


3 


10 


F16D3 . 7 


5H1A MOUSE 


202 


5 . 400E-44 


4 


11 


F59C12 .2 


5H2A CRIGR 


221 


5 . 600E-65 


4 


12 


T14E8.3 


D3DR RAT 


231 


4 .400E-51 


4 


13 


T02E9.3 


D3DR RAT 


190 


1 .600E-43 


5 


14 


F01E11.5 


OAR DROME 


293 


1 .400E-77 


3 


15 


C53C7.1 


NY4R MOUSE 


177 


2 .800E-36 


4 


16 


C30F12.6 


SSR4 RAT 


119 


9.600E-30 


4 


17 


Y4 0H4A.a 


ACM3 PIG 


485 


5 . 100E-83 


2 


18 


F41E7.3 


NK2R RAT 


148 


3 .600E-39 


5 


19 


C38C10.1 


NK1R RANCA 


232 


1.800E-59 


4 


20 


2C412 . 1 


NY4R MOUSE 


176 


2.500E-30 


4 


21 


C26F1.6 


OPSB ANOCA 


71 


5.200E-10 


3 


22 


C39B10 . 1 


AG2R HUMAN 


68 


1 . 800E-04 


2 


23 


C24A8 . 1 


D3DR RAT 


147 


8 .500E-40 


5 


24 


T07D4 . 1 


NK1R RANCA 


114 


2 . 900E-27 


5 


25 


F55E10 . 7 


OPRX PIG 


113 


1.400E-15 


3 


26 


C16D6 .2 


NY4R MOUSE 


180 


6 .400E-32 


4 


27 


C10C6.2 


NY4R MOUSE 


170 


4 . 500E-31 | 3 
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Z O 


L15B12 . 5 


ACM1 RAT 


170 


7 . 700E-50 


7 


0 Q 

z y 


t 4 7D12 . 2 


A CM 3 CHICK 


178 


2 . 000E-49 


4 


j u 


1 z 3L6 . 5 


GRPR MOUSE 


102 


1 .600E-25 


5 


"2 T 
J 1 


WQ5B5 . 2 


GRPR MOUSE 


143 


2 .400E-31 


7 




I 2 7D1 . 3 


NK1R RAT 


90 


3 . 000E-25 


5 


-5 3 


C4 9A9 . 7 


NK1R RAT 


318 


1 .400E-65 


2 


34 


AH 9 . 1 


OPSB GECGE 


62 


6.600E-07 


3 


J 5 


B0563 . 6 


ACM1 RAT 


94 


3.500E-09 


3 


3 6 


R106 . 2 


SSR4 RAT 


126 


2 .200E-43 


5 


J / 


M01E10 . 1 


IL8A RAT 


134 


2 . 100E-14 


1 


*3 O 
J O 


T07D10 . 2 


V1BR HUMAN 


140 


2 .300E-33 


6 


*5 Q 

j y 


nr /inn ~i 

F54D7 . 3 


GRHR HUMAN 


205 


3 .300E-42 


4 


4 0 


C5 0F7 . 1 


NK1R RAT 


183 


2 . 000E-35 


5 


4 1 


R13H7 . 2 


5H1A MOUSE 


67 


1 . 600E-04 


2 


A O 


Yb4E2A . 1 


SSR4 RAT 


118 


1.500E-39 


5 


A 0 


TO 7F8 . 2 


OPRX PIG 


74 


4 . 000E-11 


4 


4 4 


C3 9E6 . 6 


NY4R MOUSE 


232 


9 . 400E-43 


4 


4 5 


K10B4 . 4 


SSR4 RAT 


97 


6.400E-30 


4 


4 6 


T0 5A1 . 1 


NY4R MOUSE 


218 


1. 700E-30 


2 


4 7 


T02E9 . 1 


GPRO HUMAN 


106 


1.600E-23 


7 


4 8 


F42C5 . 2 


SSR4 RAT 


109 


1 . 500E-23 


5 


4 9 


F35G8 . 1 


NY4R MOUSE 


136 


5 . 800E-32 


4 


5 0 


K03H6 . 1 


OPRX PIG 


51 


5 . 900E-07 


5 


D 1 


F4 7D12 . 1 


ACM3 CHICK 


104 


1 . 100E-15 


3 


b Z 


C56G3 . 1 


AG2R HUMAN 


188 


1.400E-26 


3 


D 3 


T23B3 . 4 


5H1A_M0USE 


84 


1 . 800E-24 


5 


□ 4 


AC7 . 1 


NK1R RAT 


195 


4 .400E-46 


3 


bb 


C51E3 . 1 


OLF5 RAT 


83 


1.800E-08 


3 


56 


C25G6 . 5 


NY4R MOUSE 


208 


4 .600E-42 


4 


b / 


C18B10 . 4 


PAFR CAVPO 


51 


6 . 100E-02 


2 


58 


C24A8 . 4 


5HTB DROME 


102 


1.500E-14 


2 


59 


T19B10 . 10 


NK2R RAT 


84 


1.000E-07 


4 


60 


F14F4 . 1 


V1BR HUMAN 


109 


2 . 100E-25 


5 


61 


T02D1 . 6 


GPRO HUMAN 


103 


9 . 800E-18 


4 


62 


C51E3 . 2 


OPSD CATBO 


80 


4 .200E-07 


2 


63 


Y59E9 . 118 . b 


AA3R HUMAN 


60 


9.200E-04 


1 


64 


H02I12 . 3 


AA1R CHICK 


86 


6.300E-13 


4 


6 5 


F56B6 . 5 


SSR4 RAT 


119 


4 .500E-35 


5 


66 


Y116A8B . 5 


SSR4 RAT 


113 


7.100E-23 


5 


67 


C44B7 . 6 


GRHR HUMAN 


38 


1 - 800E-01 


3 


6 8 


Y24D9A. 29 . e 


GU03 RAT 


51 


4 . 900E-03 


2 


6 9 


T22D1 . 12 


NY4R MOUSE 


210 


4 .500E-41 


4 


70 


R12C12 . 3 


NY4R MOUSE 


57 


3 . 000E-07 


4 


71 


F21C10 . 9 


SSR4 RAT 


71 


2 .500E-10 


4 


72 


F59A1 . 12 


CRFR CHICK 


42 


3 . 600E-01 


3 


73 


F4 0A3 . 7 


AS 1 D 

L-AVPU 


4 7 


1 . 300E-03 


3 


74 


T19F4 . 1 


ML1C CHICK 


62 


6 . 600E-09 


4 


75 


F59B2 . 13 


SSR4 RAT 


71 


1 . 900E-05 


3 


76 


F54E4 . 1 


CASR HUMAN 


51 


9 . O0OE-01 


1 


77 


F54D1 . 5 


CASR HUMAN 


49 


8.200E-01 


1 


78 


C54A12 . 2 


OPSB ANOCA 


58 


7 . 400E-08 


3 


79 


C53A5 . 12 


ACM3 RAT 


275 


1 . 000E-33 


1 


80 


Y58G8A.208.a 


NY4R MOUSE 


186 


4 .600E-38 


4 
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D 1 


Y105C5 . v 


EBI2 HUMAN 


129 


9.700E-17 


2 


OZ 


T25B6 . 2 


NKIR RAT 


53 


1.200E-01 


2 


q ~i 
o J 


I 22G5 . 4 


GRPR MOUSE 


70 


4 . 300E-09 


3 




F3 1B9 . 1 


NKIR RANCA 


106 


1 . O00E-27 


5 


Q a 


C03G6 . 13 


ACM 2 HUMAN 


46 


1 .400E-01 


3 


o c 
86 


H09F14 . 1 


SSR4 RAT 


73 


2 .400E-12 


4 


8 7 


C45H4 . 3 


AG2R HUMAN 


46 


1 . 000E-01 


2 


88 


Y71G12A . 199 . 


VI PR MELGA 


51 


4 . 100E-02 


2 


8 9 


T23H2 . 3 


DOPl DROME 


55 


2 .200E-01 


1 


y u 


F59A7 . 8 


NKIR RANCA 


52 


2 .300E-01 


1 


y i 


F55D10 . 4 


MC4R RAT 


67 


6 . 800E-07 


4 


Q O 

y z 


F21G4 . 2 


OAR DROME 


53 


3 .400E-01 


2 


Q "3 

y j 


L15H11 . 2 


VIBR HUMAN 


89 


5 . 900E-20 


5 


94 


C10F3 . 3 




A C 
4 D 


1 . 200E- 02 


4 


95 


C06B3 . 11 


B3AR MOUSE 


53 


1 - 700E-04 


3 


96 


Y77E11A.3443 


MSHR HUMAN 


59 


1 . 000E-02 


1 


97 


K09C6.5 


MC4R RAT 


41 


5.200E-02 


3 


98 


Y41D4B.3805 . 


GRHR HUMAN 


70 


1 . 600E-04 


1 


99 


Y40H7 .d 


OAR DROME 


37 


9 . 700E-01 


1 


100 


Y116F11.ZZ8 


SSR4 RAT 


53 


4 . 800E-02 


2 


101 


T26E4 . 15 


AG2R HUMAN 


79 


2.400E-09 


3 



In addition to the above 101 C. elegans ORFs that exceeded threshold and 
as testimony to the stringent thresholds use, there is also listed in Table 11 below an 
additional 19 ORFs whose scores were just below threshold and which generated 
blast-search P values that were significant. As before, the blast searches were carried out 
against the set of 804 Swiss-Prot Rel. 35.0 known true GPCRs. Table 11 shows an 
additional 19 ORFs from C. elegans that receive scores just below threshold but show 
significant blast-search P values when compared against the set of 804 true GPCRs from 
Rel. 35.0 of Swiss-Prot. 



Table 1 1 



# 


C. elegans ORF 
Label 


Top Scoring 
Training Set Seq. 


Score 


P 


N 










102 


T26E4 . 14 


AG2R HUMAN 


73 


8 . 10E-08 


2 


103 


M01B2 . 7 


NKIR RANCA 


60 


4 . 10E-09 


4 


104 


K03H6 .5 


SSR4_RAT 


52 


6.50E-05 


4 


105 


F58D7 . 1 


SSR4 RAT 


55 


9 . 50E-06 


4 


106 


F57H12.4 


D3DR RAT 


80 


9.40E-10 


3 


107 


F53A9.5 


AAIR CAVPO 


78 


5 . 50E-06 


1 


109 


F02E8 .2 


SSR4 RAT 


90 


6 . 50E-21 


5 


110 


C51E3 .4 


OPSD CATBO 


83 


3 . 10E-06 


1 


111 


C02H7.2 


5H2A CRIGR 


61 


8 . 00E-09 


4 
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113 


Y34D9A. 152 . d 


NY4R MOUSE 


56 


2 .10E-08 


4 


114 


K06B4 . 9 


MLIX HUMAN 


73 


2 .60E-06 


2 


118 


r j /Cij . z. 


GPCR LYMST 


127 


2 . 70E-11 


3 


119 


F35F10.2 


PAFR CAVPO 


51 


8 . 50E-04 


3 


124 


C06G4 . 5 


OPRX PIG 


164 


3 .50E-23 


4 


128 


Y57A10C. 8 


OLF9 RAT 


65 


2.80E-04 


2 


132 


Y4C6A.h 


MGR8 HUMAN 


347 


2 . 30E-215 


1 


134 


R07B5 . 5 


A2AA PIG 


53 


1 .40E-04 


3 


135 


M01B2 . 9 


AG2R HUMAN 


73 


1.50E-07 


2 


140 


C51E3 .3 


AAIR CHICK 


73 


5 . 70E-05 


1 



Several comments are in order here. First, it should be stressed that the 
above analysis is not implying that there is only 120 G-protein coupled receptors in C. 
elegans. Instead, what is attempted to be demonstrated is that even if one begins with a 
small knowledge base of only 80 known GPCRs that have been selected randomly, one 
can still build a pretty useful composite descriptor for the family and use it to explore a 
largely-unexplored genome such as C. elegans. In order to have a complete enumeration 
of the GPCRs that are present in C. elegans, the composite descriptor should be built by 
using all of the GPCRs that are present in GPCRDB and not only 80 of them. Second, it 
was opted to run the BLAST searches against the set of 804 sequences in order to show 
the ability of the proposed method to extrapolate. As such, blast-search results with P 
values that are relatively high (e.g. E-02) should not be surprising since the target 
database of 804 true GPCRs is but a small fraction of the current contents of GPCRDB. 
Indeed the November 1999 release of GPCRDB contained 1,704 GPCR sequences and 
431 GPCR sequence fragments for a grand total of 2,135 entries. 



Helix-Turn-Helix 

Finally, the 19,099 ORFs of C. elegans was searched for instances of the 
helix-turn-helix binding motif using the corresponding 2,288 (=1,896+392) pattern 
composite descriptor. Of the 169 sequences that received non-zero support, only 5 
exceeded threshold: Y94H6A_142.g (in the region delineated by a.a. 65 through 95), 
C16C2.1 (in the region delineated by a.a. 59 through 89), F18C5.2 (in the region 
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delineated by a.a. 850 through 880), Y39F10A.a (in the region delineated by a.a. 125 
through 155), Y48C3A.S (in the region delineated by a.a. 113 through 143), and 
Y48C3A.S (in the region delineated by a.a. 1 13 through 143), 
The fragments were: 

>Y94H6A_142,g fragment 
I FDNTNDLVAS LLGISSI TV YRtffe KR I GE E 
>C16C2.1 fragment / 
YLSGSTRAKLAESLGLSDNQWVWFQNRRT 
>F18C5.2 fragment / 

ISRSTAKEVATARGISEGTOYSYLAMAVEK 
>Y39F10A.a f ragmen/ 
LSAYTISDLAKHFNVSMEILKIDIEGAEL 
>Y4 8C3A.s fragme/t 
NE VLNLME VAKE LN LS KRR V YD V I NVLE GL 



and their respective top-scoring sequences from the training set of 70 helix-turn helix 
segments, blast scores, P and N valued are: 



# 1 C. elegftns QRF | Top Scoring | Scor 



N 



1 


Y94fi6A 142.g 


RPSF BACSU 


50 


2.80E-06 




2 


C)^C2.1 


TER3 ECOLI 


45 


1.30E-05 




3 


^18C5.2 


VBP BPMU 


47 


9.30E-06 






/Y39F10A.a 


TNPO ECOLI 


39 


1.10E-04 






Y48C3A.S 


TNP1 ECOLI 


49 


6.40E-06 





It is to be understood that the embodiments and variations shown and 
described herein are merely illustrative of the principles of this invention and that various 
modifications may be implemented by those skilled in the art without departing from the 
scope and spirit of the invention. 
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